Parallelware Analyzer is a suite of command-line tools aimed at helping software developers to build better quality parallel software in less time. The tool can detect and fix defects related to parallelism with OpenMP and OpenACC. Data race conditions are very hard to detect and debug. It can also identify opportunities for OpenMP/OpenACC parallization on CPUs and GPUs, too.
Parallelware Analyzer supports the C and C++ programming languages as well as multi-threading, SIMD vectorization and GPU offloading paradigms using both OpenMP and OpenACC.
Parallelware Analyzer provides several command-line tools for the key stages of the parallel development workflow:
- pwreport provides a structured report displaying the actionable items (defects, recommendations, remarks, opportunities for parallelization, ...) detected at the function level and at the loop level, followed by a code coverage summary and a performance metrics summary. You can control the amount of detail to be displayed and you will get clear suggestions on what your next actions should be, whether they correspond to code changes or further invocations of Parallelware Analyzer to dig into more information.
- pwcheck looks for defects such as race-conditions and issues recommendations on best-practices and performs data-race analysis. The outputted structured report shows defects and recommendations detected for each function and loop in the code.
- pwloops provides insight into the parallel properties of loops found in the code which may constitute opportunities for OpenMP/OpenACC parallelism. There are different sub-analyses available that offer datascoping insights, array memory footprint and access patterns or the code annotated with parallelization opportunities.
- pwdirectives provides guided generation of parallel code for multicore CPUs and GPUs, with OpenMP or OpenACC, using multithreading, offloading, (loop-level) tasking or SIMD either with OpenMP or GCC/Clang/ICC compiler-specific directives.
Using Parallelware Analyzer¶
You’ll need to load the pwanalyzer module:
module load pwanalyzer
You need to run Parallelware Analyzer on compute nodes in an interactive batch job. This is especially true when your code is using Cray MPI, Cray SHMEM, UPC, etc., as your code will fail to run on login nodes. Note also that it is against the NERSC policy to run compute-intensive work on login nodes.
The following examples use a matrix multiplication example in C. You can find the code in the
examples/matmul directory inside your Parallelware Analyzer installation.
Copy the example to your working directory. To build, run:
This tool is under active development. As a result, the example commands below may show different output or results, depending on a version being used.
You should always start by invoking the
pwreport tool for code's hotspots. In this example, this corresponds to the
matmul function located in the
main.c source file. Invoke as follows. Note that included header files must be specified in the command.
$ pwreport src/main.c:matmul -- -I src/include Compiler flags: -I src/include ACTIONS REPORT FUNCTION BEGIN at src/main.c:matmul:6:1 LOOP BEGIN at src/main.c:matmul:8:5 LOOP BEGIN at src/main.c:matmul:9:9 2 remarks 1 opportunity for parallelism (1 SIMD) LOOP END 2 opportunities for parallelism (1 multi-threading and 1 offload) LOOP END LOOP BEGIN at src/main.c:matmul:15:5 LOOP BEGIN at src/main.c:matmul:16:9 LOOP BEGIN at src/main.c:matmul:17:13 LOOP END 1 recommendation and 3 remarks LOOP END 1 recommendation and 4 remarks 2 opportunities for parallelism (1 multi-threading and 1 offload) LOOP END FUNCTION END CODE COVERAGE Analyzable files: 1 / 1 (100.00 %) Analyzable functions: 1 / 1 (100.00 %) Analyzable loops: 5 / 5 (100.00 %) Parallelized SLOCs: 0 / 17 ( 0.00 %) METRICS SUMMARY Total defects: 0 Total recommendations: 2 Total remarks: 9 Total opportunities: 5 Total data races: 0 Total data-race-free: 0 SUGGESTIONS Use --level 1|2|3 to get more details, e.g: pwreport --level 2 src/main.c:matmul -- -I src/include If you want to get an overview of your whole codebase, not only the hotspot, you can use: pwreport --summary src -- -I src/include 1 file successfully analyzed and 0 failures in 148 ms
The hotspot analysis succeeds and a report is outputted with the following sections:
- ACTIONS REPORT: structured report with actionable insights per function and loop.
- CODE COVERAGE: summary of how much code could be analyzed.
- METRICS SUMMARY: aggregated summary of the actionable insights detected in the analysis.
- SUGGESTIONS: general Parallelware Analyzer usage hints.
The CODE COVERAGE report shows that all the code was successfully analyzed and the METRICS SUMMARY shows the different actionable insights detected. The ACTIONS REPORT provides a per function and loop summary of actionable insights detected. As hinted in the SUGGESTIONS section at the end, you can add
--level to increase the level of the detail of the ACTIONS REPORT.
Dig deeper into the actionable insights for your hotspots¶
--level 3 which is a more detailed level. This is very verbose but it will even provide Parallelware Analyzer invocations that you can copy and paste. For instance, let's focus on the following excerpt from the output:
$ pwreport --level 3 src/main.c:matmul -- -I src/include ... [OPP001] src/main.c:15:5 is a multi-threading opportunity SUGGESTION: use pwloops to get more details or pwdirectives to generate directives to parallelize it: pwloops --loop src/main.c:matmul:15:5 src/main.c -- -I src/include pwdirectives --omp multi src/main.c:matmul:15:5 --in-place -- -I src/include ...
You can see suggestions on how to use other tools of Parallelware Analyzer: use
pwloops to get details on the loop which constitutes an opportunity for parallelization or
pwdirectives to create a parallel version of the loop using multi-threading.
Parallelize your hotspots¶
Let's give the latter a try to add multi-threading to your matrix computation. First, let's build and run
matmul to see how long it takes for the sequential version to execute on a KNL node:
$ cc -I src/include -qopenmp src/matrix.c src/clock.c src/main.c -o matmul $ ./matmul 1500 - Input parameters n = 1500 - Executing test... time (s)= 21.183979 size = 1500 chksum = 68432918175
Now copy the command suggested by
pwreport (note that using
--in-place will modify the file, you can use
-o matmul_omp.c instead to create a new file):
$ pwdirectives --omp multi src/main.c:matmul:15:5 --in-place -- -I src/include Compiler flags: -I src/include Results for file 'src/main.c': Successfully parallelized loop at 'src/main.c:matmul:15:5' [using multi-threading]: 15:5: [ INFO ] Parallel forall: variable 'C' 15:5: [ INFO ] Loop parallelized with multithreading using OpenMP directive 'for' 15:5: [ INFO ] Parallel region defined by OpenMP directive 'parallel' Successfully updated src/main.c
Build and run again to compare the performance. This time 272 OpenMP threads are used:
$ cc -I src/include -qopenmp src/matrix.c src/clock.c src/main.c -o matmul $ export OMP_NUM_THREADS=272 $ ./matmul 1500 - Input parameters n = 1500 - Executing test... time (s)= 0.391569 size = 1500 chksum = 68432918175
On a Cori KNL node the execution went from 21 seconds to less than half a second: more than a 54x speedup.
Parallelware Analyzer is composed of several tools:
pwreport is the link between all of them and will offer usage suggestions for different use cases.
For instance, if you look back at the previous example, you can see a suggestion to invoke
pwloops --loop src/main.c:matmul:15:5 src/main.c -- -I src/include
Each tool composing Parallelware Analyzer has many different subanalyses available. Use
--help to get a listing of them along with other options available.
In general, you should pay attention to the suggestions in the more detailed level of
pwreport on what is available for each actionable insight.
Analyzing files and directories¶
By default, you are required to provide a hotspot (either a function or a loop) to be analyzed. However, in many cases you need to analyze an entire file or directory. You can do so by passing the
pwreport. To avoid large outputs, by default, the ACTIONS REPORT is not printed when
--summary is used unless
--detail is also passed, for instance:
pwreport src --summary --detail -- -I src/include
All tools accept a configuration through the
--config argument. It can store compiler flags (such as
-I src/include in the example) to be used when analyzing different files, integrate with build tools (e.g., to obtain the compiler flags from a JSON Compilation Database) or declare file dependencies to enable inter-procedural analysis across different source files.
For more details, take a look at
examples/config in the root folder of your Parallelware Analyzer installation.
Integration with build tools¶
Supplying the required compiler flags can be a hassle. Parallelware Analyzer can consume a JSON Compilation Database. This can be generated using CMake or with tools such as bear that intercept compilation commands from different build systems.
If you build the example using CMake, you will find a
compile_commands.json file in the build directory. You can use the configuration file to instruct Parallelware Analyzer to use it or, if you don't need any other settings, pass it to
mkdir build cd build cmake .. pwreport --config compile_commands.json ../src/main.c:matmul
For more details, take a look at
examples/config in the root folder of your Parallelware Analyzer installation as well as at the Using CMake's compilation database with Parallelware Analyzer blog post.
Inter-procedural analysis across multiple files¶
Parallelware Analyzer supports interprocedural analysis across multiple source files. This is required for instance when your hotspot invokes a function defined in another source file. In these cases, you will need to declare the file dependencies using the configuration file.
For more details, take a look at
examples/config in the root folder of your Parallelware Analyzer installation as well as at the Interprocedural analysis across source code files with Parallelware Analyzer blog post.
Use Profiling Tools¶
Since the tool relies on a static code pattern analysis in making OpenMP/OpenACC parallelization suggestions, it does not know how much actual performance improvement will be achieved with adoption of suggested parallelization changes. To assess the resulting performance, you will need to profile code performance using profiling tools before and after the changes. If the suggested parallelization was not a performance hotspot, one is expected to only observe minor performance gains. Users are expected to work further on optimizing their code (cache use optimizations, chunk scheduling, loop collapsing, etc.) with help of a profiling tool.