Skip to content

Profiling and debugging Python

Here we will describe several tools and strategies for profiling Python code. These are generally in order from simplest to most complex, and we recommend that you also profile your application in a similar order.

Keep it simple

Often a simple tool can provide almost as much information as a complex tool with far less headache. Our general profiling philosophy is to start with the easy tools and move to complex tools as you require them.

Arm Performance Reports for Python

Arm Forge offers an easy to use profiling tool called "Performance Reports". To use this tool, you'll just need to

module load allinea-forge
perf-report srun -n 32 -c 2 python -args

This automatically will write .txt and .html output files which can be viewed in your text editor or browser, respectively. To view in your browser, you'll probably want to scp the data back to your local machine. No additional instrumentation or customization is required. Outputs include memory usage and MPI information. An example performance report output.


Python has a nice, built-in statistical profiling module called cProfile. You can use it to collect data from your program without having to manually add any instrumentation. You can then visualize the data you collected with several tools including SnakeViz and gprof2dot.

cProfile can be used to collect data for a single process or many processes (for a Python code using mpi4py, for example.)

We will start with an example to profile a single process. In this example, we will profile a Python application on Cori and transfer the results to our local machine to visualize. This example will generate a .prof file which contains the data we ultimately want.

Now you can run you application and collect data using cProfile:

python -m cProfile -o path/to/your/script arg1 arg2

Once your application is done running you can use scp or Globus to transfer the file to your local machine.

Now let's examine a more complex case in which we profile several processes from a Python program using mpi4py based on this page. We might want to do this to check that we are adequately load-balancing among the MPI ranks, for example. This example will write many files; one for each MPI rank. To avoid making a mess we recommend creating a new directory for your results and cd into this directory before running your application. This will also write some human-readable cpu_0.txt files (which are nice as a sanity-check but ultimately not as useful).

You will need to modify the beginning of your script by adding:

import cProfile, sys
from mpi4py.MPI import COMM_WORLD

You will also need to surround the part of your script that calls the main function with:

pr = cProfile.Profile()

And then at the end of your program, you should add:

# Dump results:
# - for binary dump
pr.dump_stats('' %comm.rank)
# - for text dump
with open( 'cpu_%d.txt' %comm.rank, 'w') as output_file:
    sys.stdout = output_file
    pr.print_stats( sort='time' )
    sys.stdout = sys.__stdout__

Now you can run your script as you would normally (i.e. no cProfile command is required):

python /path/to/your/script arg1 arg2

This modification will make your application write results to both a .prof file and a .txt file for each mpi rank. Then you can use scp or Globus to transfer the output files to your local machine for further analysis.


Now that we have created a .prof file (or files) it's time to visualize the data. One browser-based option for this is SnakeViz.

You can install SnakeViz using pip:

pip install snakeviz

We recommend that you pip install inside a conda environment. Please see the NERSC page on pip installation for more information.

Then in a command line, navigate to the directory where your .prof file is located, and type:


This should open an interactive window in your default browser that displays the data in your .prof file. By default data are displayed in a sunburst plot:


The call-stack depth can be adjusted to show deeper functions. You can also click on a particular function which will generate a new plot with the selected function now at the center.

Data can also be displayed in an icicle plot:


SnakeViz also creates a table listing the most expensive functions in the call-stack in descending order:



Another option for visualizing data in a .prof file is gprof2dot. gprof2dot differs from SnakeViz in that it writes static image files (usually .png) which are perhaps easier to view and share than their browser-based cousins, but they are also not interactive.

gprof2dot displays the data in dot-graph format. Embedded in this graph is also the call-stack information. With gprof2dot it is easier to visualize the typical flow of your application than it is in SnakeViz.

Note: to use gprof2dot, you must also have graphviz installed.

You can install gprof2dot using pip install:

pip install gprof2dot

We recommend that you pip install inside a conda environment. For more information see using pip at NERSC.

Note that for gprof2dot to work correctly, you must use Python 3. Also note that you also need to either copy the file to the directory in which you will run the command, or you will need to add it to your search path through something like:

export PYTHONPATH=/path/to/where/pip/installed/gprof2dot:$PYTHONPATH

To generate the gprof2dot dot files from cProfile .prof data:

python -f pstats | dot -Tpng -o output.png

A typical gprof2dot image created from the same .prof data used in the SnakeViz section above is shown below.


If you have an MPI application and have created many .prof files, you can use shell script magic to gprof2dot your data in a single command. Navigate to the directory in which your .prof files are located. Here is an example for KNL (68 files):

for i in `seq 0 67`; do python3 -f pstats cpu_$ | dot -Tpng -o cpu_$i.png; done


Once you have discovered where the expensive parts of your application are, it is probably time for another tool called line_profiler. This is a particularly handy tool because it gives line-by-line performance information for a selected function.

line_profiler can be installed with pip:

pip install line_profiler

We recommend that you pip install inside a conda environment. For further reading see using pip at NERSC.

def _xypix(arg1,arg2,arg3)
   DESI sample function
   return result

Once you have decorated the functions for which you would like line-by-line performance data, you can run line_profiler:

kernprof -l

This will write an output file with the name of your script, i.e. script_to_profile.lprof.

You'll need one more step before you get nice, human-readable output:

python -m line_profiler script_to_profile.lprof > line_profiler_out.txt

Here is the line_profiler output for our expensive function_xypix:


Arm Map for Python

Arm Forge has added support for Python to their Map profiling tool. Since this is a "big tool" we recommend that you try this tool after you have profiled your code with something simple and have some idea what you're looking for. That said, this is probably the easiest "big tool" that we support for Python on our systems. It can profile multiprocessing and single or multinode MPI jobs out of the box. Perhaps the only major drawback is that it cannot display information for individual multiprocessing threads or MPI ranks; it will only show the collected statistics for all profiled threads/ranks. If you require simultaneous information for individual threads/ranks you may consider VTune or Tau instead (see below).

There are two ways to use Map on our systems: either using NX or using a Remote Client/reverse connection on your own system (i.e. your laptop).

Arm Forge Map using NX

Log onto NERSC using NX as usual. Then, open a connection to Cori and get an interactive node:

salloc --account <NERSC project> --qos interactive ...

Once you have a node:

module load allinea-forge
map python ...

Once you have started your program with the map command, the Arm Forge GUI window will open. Click run to start the profiling run. Don't panic: it will add some overhead to the normal runtime of your code. When the profiling data collection is finished and the statistics are aggregated, a window that displays the data from your application will automatically open. You will find the metrics displayed as a function of time, callstack information, MPI information, etc.

If you are profiling an MPI application, you will need to run: map srun python The difference is the srun command. One other MPI option is to use the --select-ranks option if you would like to profile only specific MPI ranks.

If you are profiling a Python multiprocessing application, you will run something like: map --no-mpi python Unlike MPI, you will not need an srun command and adding the --no-mpi flag saves a little time (although it is not required). Note: The number of cores displayed in the top left of the Map GUI will be incorrect. This is a known issue for Python multiprocessing that the Map team is addressing. The rest of the profiling information should be valid, but if you notice inconsistencies, please file a ticket and we will pass your information on to the Map team.


Arm Forge Map using Remote Client

The workflow for using the Map profiler from the Remote Client is similar to using NX, but with a few extra complications. The tradeoff for these additional steps is that you'll get faster response and better graphics resolution using the GUI then you would over NX.

On your local machine

You'll first need to visit Arm Forge's website to download a copy of their remote client for your laptop/desktop. You may need to scroll down the page a bit to find the remote client. Follow the steps to install the remote client software for your OS.

Open the GUI and click on the dropdown menu under Remote Launch. Click the Configure option. Then click Add.

**Connection name:** Cori  
**Host name:**  
**Remote installation directory:** /global/common/sw/cray/cnl6/haswell/allinea-forge/default  
**Remote script:** /global/common/sw/cray/cnl6/haswell/allinea-forge/remote-init  
**Uncheck the Proxy through login node box**

Hit ok-- this will save your connection configuration to Cori.

Then in the GUI main menu, under Remote Connection, hit Cori. You will have to enter your login credentials for Cori (password + OTP). If you are using sshproxy, you should not be prompted to enter your information. Once your remote client has successfully launched, you will begin your profiling run on Cori.

On Cori

After you have started the remote client, ssh to Cori as you would normally.

salloc --account <NERSC project> --qos interactive ...

Once you have an interactive node:

module load python
module load allinea-forge
map --connect python ...

Once you execute this command on Cori, you will be prompted to accept a reverse connection on your local machine. Click Accept and the profiling window will open and automatically start.


You must uncheck some metrics

Unlike the NX version, you will need to click the Metrics box and uncheck all but the first two CPU options. Otherwise you will have errors and the profiling will fail.

Once you have unchecked all but the first two CPU metrics, you can click OK and profiling will start in the same way that it did via NX. After profiling the GUI will display the results of your profiling.

Intel VTune

Intel VTune is best used after other types of profiling have been able to give you some idea what your problem (or problems) are. For example, do you have some lines of code that are slow for reasons you don't understand? Do you spend a lot of time in MPI calls? Do you have a load-imbalance problem (i.e. one of your MPI ranks is working much harder than the others)? If so it is likely time to start using heftier tools like VTune and Tau (described in the next section).

VTune shines in that it can tell if you memory access problems are slowing you down, where your MPI overhead is coming from, and if any MPI functions that have an explicit (or implicit) barrier are causing unnecessary waiting in your application.

Using VTune to profile Python codes requires a few changes from the workflow of profiling compiled C/C++/Fortran codes. NERSC recommends that users first familiarize themselves with the general VTune documentation before using VTune to profile Python code. Here we summarize the changes required in order to use VTune to profile Python code effectively on Cori.

Unfortunately VTune has limited functionality for Python applications: only three types of collection modes are currently supported:

  1. hotspots (general performance overview)
  2. memory-consumption (memory-specific information)
  3. threading (threading analysis)

VTune works best with Intel Distribution for Python

If one attempts to profile a code with VTune which uses Anaconda Python, typically VTune will stop profiling the code immediately, and will provide no useful warning or error message. It is strongly recommended that users instead use Intel Distribution for Python (IDP) when profiling their codes with VTune.

At this point we recommend logging on to Cori via NX to view the results in the VTune GUI. You can also use -X or -Y for ssh forwarding, but it will be very frustratingly slow. As a result we highly recommend NX for the next steps in this process.

To view results of a VTune profile of a Python code, first log onto Cori via NX. Then run

module load vtune

and then run


to start the VTune Profiler GUI. Use the file browser to navigate to the .amplxe file written in your results_dir.nid00058 directory and click open. An overview tab will display some high level results. For more specific information, try selecting the Bottom-up tab. It should look something like:


You can see that in our DESI example, much of the time is spent in MPI calls. This is something that would have been hard to figure out using other methods of profiling.

You can also try selecting the Caller/Callee tab which will display call-stack information about your application. Sometimes it can be difficult to find what you are looking for, especially for Python calls which are often intermixed with much lower-level functions. This is why we recommend profiling with a simpler tool first so you have an idea of what information you are looking for.


Known Vtune issue: Stack size errors require additional VTune flag to work around

VTune will often issue the following warning message when profiling Python codes, even when using IDP:

Stack size provided to sigaltstack is too small. Please increase the stack size to 64K minimum

Intel documents this error message and recommends that users add the VTune flag -run-pass-thru=--no-altstack to avoid the error.


Tau is probably the most complex profiler described on this page, but also the most useful.

If you can, build your own Tau

NERSC does offer a pre-built Tau module that you can access using module load tau but we don't recommend this approach for Python. Tau works best when it is built from scratch under exactly the conditions your application will use. This may sound like a difficult task, but if done carefully, it will produce excellent Tau functionality that will give you a great deal of information about how your Python code performs.

In this example, we will build Tau to profile the same DESI code we have been investigating throughout this page.

First download Tau and configure Tau with the options that match your needs:

gunzip tau.tgz; tar -xvf tau.tar;
cd tau-2.27.1
./configure -python -mpi -arch=craycnl
make install clean -j 8

A word about how we build Tau in this example: we chose the -python flag for our Python application and the -mpi flag since we are using MPI. Also since we are building Tau to use on Cori we specify -arch=craycnl since Cori is a Cray machine. Then we configure Tau with these options. The -j specifies to use 8 processors to configure Tau, but please don't exceed this number (doing so will really make things miserable for other users on your login node-- don't be that person).

Once we have successfully configured Tau, we need to set some very important environment variables:

export LD_LIBRARY_PATH=/path/to/your/conda/lib:$LD_LIBRARY_PATH
export PATH=/global/cscratch1/sd/elvis/tau-2.27.1/craycnl/bin:$PATH

LD_LIBRARY_PATH should point to the Python distribution you are using on Cori. In our case it is to the DESI conda version. PATH should point to the location in which you installed and configured Tau.

There is still one important step before you can run your Python MPI code. (If you do not have an MPI code you can skip this step.) You will need to edit the highest level function of the program you intend to profile (in our case it is At the top of this file you will need to add:

import tau

if __name__ == '__main__':
    args = extract.parse()"extract.main(args)")

where main is the name of the main function in your job. You MUST add this wrapper to your Python MPI code to enable Tau to profile it.

Now we're ready to run. Request an interactive job. Then either cd into the directory with your Python script (or specify the full path to your script) and try:

time srun -u -n 32 -c 2 tau_python ./ arg1 arg2

Once your application is done running, Tau should have written one (or many, in the case of MPI) profile.0.0.0 files in the directory where you executed your run command.

Use the Tau Paraprof GUI on your local machine

We recommend moving these files to your local machine to use the Tau analysis GUI. (The Tau GUI uses so much memory it will crash NX, so just move the files to your local system. The GUI response will also be much faster.)

To do this we must first pack the files on Cori:

paraprof --pack datafile.ppk

and then use scp or Globus to transfer the files to your local machine.

You should install and configure Tau on whichever machine you will use to view the results. You must also set an environment variable on your local machine to point to your Tau installation. On my mac it looks like:

export PATH=/Applications/tau-2.27.1./apple/bin:$PATH

Once you have configured Tau on your local machine you can now unpack the files we transferred from Cori:

paraprof --dump datafile.ppk

Once you have unpacked the files you can type

paraprof in your command prompt in the directory with your unpacked files. This should start the Tau Paraprof GUI. It should open a window that looks something like:


Clicking on one of the functions will open another window which shows how much execution time is spent on this function (in our case, our friend _xypix) for each MPI rank. It also displays the max, min, and standard deviation.


You can also click on a node number (but really, this should be labeled as rank number) and it will open another window which shows a top-down list of time spent in functions for this rank. It should look something like:


Using the custom-built Tau functionality makes it easy to see exactly what your Python program is doing, particularly if it is an MPI program. In our case we can see that only 20 of the 32 ranks are actually doing any work, which leads to a lot of wasted execution time. (Note that VTune also indicated this, but it was much more difficult to understand why exactly this was happening.) For the ranks that are doing some work, it is good to see that they are relatively load-balanced. If they were not, this would be an easy way to see that their workload needs to be distributed more effectively.

Tau has many additional features and functionality which are not described here (including tracing, which shows the performance of your application as a function of time). These may be worth investigating depending on your particular Python profiling needs.

Debugging Python

Parallel debugging can be a challenge.

Arm Forge has released a parallel Python debugger.

To use this tool at NERSC to debug an mpi4py code:

module load allinea-forge
ddt srun -n 2 python %allinea_python_debug%