HPCToolkit is an integrated suite of tools for measurement and analysis of application performance on both CPUs and GPUs. HPCToolkit’s measurements provide support for analyzing a program execution cost, inefficiency, and scaling characteristics both within and across nodes of a parallel system.
HPCToolkit’s work flow is organized around four principal capabilities:
- measurement of context-sensitive performance metrics using call-stack unwinding while an application executes (
- binary analysis to recover program structure from application binaries (
- attribution of performance metrics by correlating dynamic performance metrics with static program structure (
- presentation of performance metrics and associated source code (
Loading the HPCToolkit Module on Cori¶
The tool may fail to work properly with the
darshan module. Until this is auto unloaded when
hpctoolkit is loaded, please unload the module before proceeding to the next step.
To use HPCToolkit, first load the
hpctoolkit module to set the correct environment settings:
module load hpctoolkit
If you are working on a GPU-accelerated application and using the Cori-GPU system, one must first load the
cgpu module before loading the toolkit to get the user environment on Cori-GPU:
module purge module load cgpu module load hpctoolkit
There are two versions of HPCToolkit on our system. One for Cori and a separate installation for Cori-GPU. The GPU version supports
-e gpu=nvidia and -
e gpu=nvidia,pc for fine-grain measurements using PC sampling. You must load the
cgpu module to access the GPU enabled HPCToolkit.
Compiling your application on Cori¶
For the most detailed attribution of application performance data using HPCToolkit, code must be compiled with line map information (such as
-g -O3). Check compiler documentation for right set of options for mapping without compromising optimization.
For GPU-accelerated application one must compile / build and run on the Cori-GPU and add
-gpu=lineinfo in order to generate line-number information for device code. When using the NVIDIA compiler in addition to the above please use the
-gopt -fast to map your application without compromising performance.
Measuring performance using
Measurement of your application performance using
hpcrun will be different based on whether your program is dynamically or statically linked.
For dynamically linked applications:
srun [srun-options] hpcrun [hpcrun-options] <your_application> [app-arguments]
For statically linked application: One must first link hpcrun’s monitoring code into app, using
hpclink <linker> -o <your_application> <linker-arguments>
and then monitoring the application by passing
hpcrun options through environment variables:
export [hpcrun-options] srun [srun-options] <your_application> [app-arguments]
To get a list of environment variables that affect monitoring one can use the hpclink’s --help option:
hpclink --help ... To control HPCToolkit's performance measurement library during an application's execution, use the following environment variables: HPCRUN_EVENT_LIST=<event1>[@<period1>];...;<eventN>[@<periodN>] : Sampling event list; hpcrun -e/--event HPCRUN_TRACE=1 : Enable tracing; hpcrun -t/--trace HPCRUN_PROCESS_FRACTION=<f>: Measure only a fraction <f> of the execution's processes; hpcrun -f/-fp/--process-fraction HPCRUN_OUT_PATH=<outpath> : Set output directory; hpcrun -o/--output ...
Recovering the program structure with
In this step the toolkit analyzes the application's binary and computes a representation of its static source code structure.
One can either run
hpcstruct single threaded or accelerate using the
-j n parameter where
n is the number of threads:
hpcstruct saves this program structure in a file named
<your application>.hpcstruct which is then passed to
hpcprof in the next step.
On Cori-GPU in addition to the above step for GPU-accelerated applications to enable detailed analysis of GPU functions, one must add an additional step to obtain compute program structure information for the application machine code (or cubins for CUDA).
--gpucfg <yes|no> to compute loop nesting structure for GPU machine code. Loop nesting structure is only useful when measurements are collected with PC sampling (
-e gpu=nvidia,pc). Default is set to
--gpucfg no, and sampling is not enabled.
#analyze CPU binaries hpstruct <your_application> #analyze all the GPU binaries in the measurements directory hpstruct --gpucfg <yes/no> <your_application>-measurements directory
If your measurements directory contains fine-grain measurements collected using PC sampling, the version of
nvdisasm is very slow and fails to compute control flow graphs for some GPU functions. As recommended by the toolkit, before using the --gpucfg yes option, please refer to the notes in Section 12.5 of the HPCToolkit documentation.
Analyzing the profiles and mapping to the source code with
Once you have the call path profile (dynamic measurement using
hpcrun) and your static program structure (using
hpcstruct), HPCToolkit combines them to interpret the profile and co-relate with the source code using either
hpcprof-mpi. Both commands are designed to generate same set of summary metrics over all threads and processes in an execution;
hpcprof-mpi is designed to process (in parallel) the measurements from large-scale executions.
hpcprof-mpi must be run as batch job with access to multiple MPI ranks either on a single Cori node or multiple nodes across the system.
To analyze the profile and generate a performance database :
hpcprof -S <your_application>.hpcstruct -I <your_application src>/+ \ hpctoolkit-<your_application>-measurements1 [hpctoolkit-<your_application>-measurements2 ...]
hpcprof is capable of combining multiple measurement databases, as long as they are gathered against the same binary.
Parameters passed to the
|-S/--structure||takes a program structure file; generated using |
|-I/--include||directory to application source code; |
|-O/--output||output database directory path. Default is |
Presenting the performance profile using
To interactively view and analyze an HPCToolkit performance database generated in the previous step one must use the
hpcviewer is a GUI tool, we recommend you use the NoMachine / NX on Cori or
ssh -Y when logging in to the system.
module load hpcviewer hpcviewer hpctoolkit-<your_application>-database
To use the
hpcviewer utility to view the trace of sample events as part of a measurement database in addition to a profile, one must add
-t/--trace flag to enable tracing with
hpcrun. Follow the steps to generate the performance database using
hpcprof-mpi and feed that into the
hpcviewer as shown above.
One can also install
hpcviewer on their local system (download instructions); transfer the performance database to your machine and interactively view the profile.
If you have large performance database(s) in the O(100) MBs that you're trying to view it is highly recommended that one downloads the database to your local system and use the local version of
hpcviewer to prevent any performance lags. See here for NERSC recommended file transfer option(s).