Skip to content



Codee (previously known as Parallelware Analyzer) is a suite of command-line tools that automates code inspection from the performance perspective. Codee scans the source code without executing it and produces aimed at helping software developers to build better quality parallel software in less time. The tool can detect and fix defects related to parallelism with OpenMP and OpenACC. Data race conditions are very hard to detect and debug. It can also identify opportunities for OpenMP/OpenACC parallelization on CPUs and GPUs, too.

Codee supports the C and C++ programming languages as well as multi-threading, SIMD vectorization and GPU offloading paradigms using both OpenMP and OpenACC.

Command-line tools

Codee provides several command-line tools for the key stages of the parallel development workflow:

  • pwreport provides a structured report displaying the actionable items (defects, recommendations, remarks, opportunities for parallelization, ...) detected at the function level and at the loop level, followed by a code coverage summary and a performance metrics summary. You can control the amount of detail to be displayed and you will get clear suggestions on what your next actions should be, whether they correspond to code changes or further invocations of Codee to dig into more information.
  • pwloops provides insight into the parallel properties of loops found in the code which may constitute opportunities for OpenMP/OpenACC parallelism. There are different sub-analyses available that offer data scoping insights, array memory footprint and access patterns or the code annotated with parallelization opportunities.
  • pwdirectives provides guided generation of parallel code for multicore CPUs and GPUs, with OpenMP or OpenACC, using multithreading, offloading, (loop-level) tasking or SIMD either with OpenMP or GCC/Clang/ICC compiler-specific directives.

Using Codee

You need to run your applications on compute nodes in a batch job. This is especially true when your code is using Cray MPI, Cray SHMEM, UPC, etc., as your code will fail to run on login nodes. Note also that it is against the NERSC policy to run compute-intensive work on login nodes.

For this example, we start an interactive batch job on Perlmutter:

salloc -C gpu -A <GPU_allocation_account> -G 1 -N 1 -t 30 -q interactive

You’ll need to load the codee module:

module load codee

The following examples use a matrix multiplication example in C. You can find the code in the examples/matmul directory inside your Codee installation.

Copy the example to your working directory. To build, run:



This tool is under active development. As a result, the example commands below may show different output or results, depending on a version being used.

Analyze hotspots

pwreport is the starting tool in most use cases, providing the entry-level reports, notably the --evaluation report which provides high level metrics and the --actions report which provides the structured report showing the detected actionable insights per function and loop.

You should always start by invoking the pwreport tool for your hotspots. In this example, this corresponds to the matmul function located in the main.c source file. Invoke as follows. Note that included header files must be specified in the command.

$ pwreport --evaluation --include-tags all src/main.c:matmul -- -I src/include
Compiler flags: -I src/include

Target            Lines of code Analyzed lines Analysis time # actions Effort Cost    Profiling
----------------- ------------- -------------- ------------- --------- ------ ------- ---------
src/main.c:matmul 55            14             57 ms         8         64 h   2094€   n/a

Target            Serial scalar Serial control Serial memory Vectorization Multithreading Offloading
----------------- ------------- -------------- ------------- ------------- -------------- ----------
src/main.c:matmul 0             0              3             3             1              1

Target : analyzed directory or source code file
Lines of code : total lines of code found in the target (computed the same way as the sloccount tool)
Analyzed lines : relevant lines of code successfully analyzed
Analysis time : time required to analyze the target
# actions : total actionable items (opportunities, recommendations, defects and remarks) detected
Effort : estimated number of hours it would take to carry out all actions (serial scalar, serial control, serial memory, vectorization, multithreading and offloading with 1, 2, 4, 8, 12 and 16 hours respectively)
Cost : estimated cost in euros to carry out all the actions, paying the average salary of 56,286€/year for a professional C/C++ developer working 1720 hours per year
Profiling : estimation of overall execution time required by this target

  You can specify multiple inputs which will be displayed as multiple rows (ie. targets) in the table, eg:
        pwreport --evaluation some/other/dir --include-tags all src/main.c:matmul -- -I src/include

  Use --actions to find out details about the detected actions:
        pwreport --actions --include-tags all src/main.c:matmul -- -I src/include

  You can focus on a specific optimization type by filtering by its tag (serial-scalar, serial-control, serial-memory, vectorization, multithreading, offloading), eg.:
        pwreport --actions --include-tags serial-scalar src/main.c:matmul -- -I src/include

1 file successfully analyzed and 0 failures in 57 ms

The entry-level performance optimization report lists the total number of actions found in the code, as well the total number of lines of code analyzed and the time needed by the tool to complete the inspection of the code. In addition, this entry-level report provides a breakdown of the total number of actions into the steps of the performance optimization process, from sequential optimization to memory optimization to vectorization, including offloading to accelerator devices like GPUs. Finally, the report suggests subsequent command-line tools invocations to assist the developer through the performance optimization process.

Typically, the next step is invoking the pwreport tool with --actions to show the details about the actions detected in the code. The command-line invocation is as follows:

$ pwreport --actions --include-tags all src/main.c:matmul -- -I src/include
Compiler flags: -I src/include


  FUNCTION BEGIN at src/main.c:matmul:6:1
    6: void matmul(size_t m, size_t n, size_t p, double **A, double **B, double **C) {

    LOOP BEGIN at src/main.c:matmul:8:5
      8:     for (size_t i = 0; i < m; i++) {

      LOOP BEGIN at src/main.c:matmul:9:9
        9:         for (size_t j = 0; j < n; j++) {

        [RMK011] src/main.c:9:9 the vectorization cost model states the loop might benefit from explicit vectorization

        [OPP002] src/main.c:9:9 is a SIMD opportunity
      LOOP END

    LOOP BEGIN at src/main.c:matmul:15:5
      15:     for (size_t i = 0; i < m; i++) {

      LOOP BEGIN at src/main.c:matmul:16:9
        16:         for (size_t j = 0; j < n; j++) {

        LOOP BEGIN at src/main.c:matmul:17:13
          17:             for (size_t k = 0; k < p; k++) {
          17:             for (size_t k = 0; k < p; k++) {

          [PWR010] src/main.c:17:13 'B' multi-dimensional array not accessed in row-major order
          [RMK010] src/main.c:17:13 the vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
        LOOP END
        [PWR039] src/main.c:16:9 consider loop interchange to improve the locality of reference and enable vectorization
      LOOP END
      [PWR035] src/main.c:15:5 avoid non-consecutive array access for variables 'A', 'B' and 'C' to improve performance

      [OPP001] src/main.c:15:5 is a multi-threading opportunity
      [OPP003] src/main.c:15:5 is an offload opportunity

  Analyzable files:            1 / 1     (100.00 %)
  Analyzable functions:        1 / 1     (100.00 %)
  Analyzable loops:            5 / 5     (100.00 %)
  Parallelized SLOCs:          0 / 14    (  0.00 %)

  Total recommendations:         3
  Total opportunities:           3
  Total defects:                 0
  Total remarks:                 2


  Use --level 0|1|2 to get more details, e.g:
        pwreport --level 2 --actions --include-tags all src/main.c:matmul -- -I src/include

  3 recommendations were found in your code, get more information with pwreport:
        pwreport --actions --include-tags pwr src/main.c:matmul -- -I src/include

  3 opportunities for parallelization were found in your code, get more information with pwloops:
        pwloops src/main.c:matmul -- -I src/include

  More details on the defects, recommendations and more in the Knowledge Base:

1 file successfully analyzed and 0 failures in 20 ms

The hotspot analysis succeeds and a report is outputted with the following sections:

  • ACTIONS REPORT: structured report with actionable insights per function and loop.
  • CODE COVERAGE: summary of how much code could be analyzed.
  • METRICS SUMMARY: aggregated summary of the actionable insights detected in the analysis.
  • SUGGESTIONS: general Codee usage hints.

In our MATMUL example code, the Codee output shows that the source code file was analyzed successfully (0 failures), providing actions for all of the 5 loops of the code. In total Codee reported 3 opportunities for parallelization and 3 recommendations from the open catalog of best practices for performance optimization, including memory optimization, vectorization, multithreading and offloading. As suggested by the tool, you can add --level to increase the level of the detail of the Codee performance optimization report.

Dig deeper into the actionable insights for your hotspots

Try adding --level 2 which is a more detailed level. This is very verbose but it will even provide Codee invocations that you can copy and paste. For instance, let's focus on the following excerpt from the output:

$ pwreport --actions --level 2 --include-tags all src/main.c:matmul -- -I src/include
      [OPP003] src/main.c:15:5 is an offload opportunity
        Compute patterns:
          - 'forall' over the variable 'C'

        SUGGESTION: use pwloops to get more details or pwdirectives to generate directives to parallelize it:
          pwloops src/main.c:matmul:15:5 -- -I src/include
          pwdirectives --omp offload src/main.c:matmul:15:5 --in-place -- -I src/include

        More information on:

You can see suggestions on how to use other Codee command-line tools: use pwloops to get more detailed information about the loop or pwdirectives to actually rewrite the code using offloading in this example.

Optimize the performance of your hotspots

Let's give the latter a try to add OpenACC offloading to your matrix computation. First, let's build and run matmul to see how long it takes for the sequential version to execute:

$ nvc -I src/include src/matrix.c src/clock.c src/main.c -o matmul
$ srun -n 1 ./matmul 1500
- Input parameters
n    = 1500
- Executing test...
time (s)= 12.826260
size    = 1500
chksum    = 68432918175

Now copy the command suggested by pwreport using OpenACC directives (note that using --in-place will modify the file, you can use -o main_acc.c instead to create a new file):

$ pwdirectives --acc src/main.c:matmul:15:5 -o src/main_acc.c -- -I src/include
Compiler flags: -I src/include

Results for file 'src/main.c':
  Successfully parallelized loop at 'src/main.c:matmul:15:5' [using offloading without teams]:
      [INFO] src/main.c:15:5 Parallel forall: variable 'C'
      [INFO] src/main.c:15:5 Parallel region defined by OpenACC directive 'parallel'
      [INFO] src/main.c:15:5 Loop parallelized with OpenACC directive 'loop'
      [INFO] src/main.c:15:5 Data region for host-device data transfers defined by OpenACC directive 'data'
Successfully created src/main_acc.c

Minimum software stack requirements: OpenACC version 2.0 with offloading capabilities

The modified code is as follows:

$ cat src/main_acc.c
    #pragma acc data copyin(A[0:m][0:p], B[0:p][0:n], m, n, p) copy(C[0:m][0:n])
    #pragma acc parallel
    #pragma acc loop
    for (size_t i = 0; i < m; i++) {
        for (size_t j = 0; j < n; j++) {
            for (size_t k = 0; k < p; k++) {
                C[i][j] += A[i][k] * B[k][j];
    } // end parallel
    } // end data

Build and run again to compare the performance, using the default CUDA thread block configuration handed by the system for this particular problem (128 CUDA threads per thread block, and 12 thread blocks):

$ nvc -acc -fast -gpu=cc80 -I src/include src/matrix.c src/clock.c src/main_acc.c -o matmulAcc
$ srun -n 1 -G 1 ./matmulAcc 1500
- Input parameters
n    = 1500
- Executing test...
time (s)= 1.277405
size    = 1500
chksum    = 68432918175

On a Perlmutter node with 1 GPU, the execution went from 12.8 seconds to less than 1.3 second.

Other analyses

Each tool composing Codee has many different sub-analyses available. Use --help to get a listing of them along with other options available.

In general, you should pay attention to the suggestions in the more detailed level of pwreport on what is available for each actionable insight.

Integration with build tools

Supplying the required compiler flags for Codee to analyze your source code successfully can be a hassle (e.g., flag -I to include header files, flag -D to define compilation symbols). Codee supports several mechanisms for the user to provide compilation flags, the recommended option being the usage of a JSON Compilation Database. This can be generated using CMake or with tools such as bear that intercept compilation commands from different build systems.

If you build the example using CMake with -DCMAKE_EXPORT_COMPILE_COMMANDS=ON, you will find a compile_commands.json file in the build directory. You can use the configuration file to instruct Codee to use it or, if you don't need any other settings, pass it to --config:

mkdir build
cd build
pwreport --config compile_commands.json ../src/main.c:matmul

For more details, take a look at docs/ and examples/config in the root folder of your Codee installation.

Use Profiling Tools

Since the tool relies on a static code pattern analysis in making OpenMP/OpenACC parallelization suggestions, it does not know how much actual performance improvement will be achieved with adoption of suggested parallelization changes. To assess the resulting performance, you will need to profile code performance using profiling tools before and after the changes. If the suggested parallelization was not a performance hotspot, one is expected to only observe minor performance gains. Users are expected to work further on optimizing their code (cache use optimizations, chunk scheduling, loop collapsing, etc.) with help of a profiling tool.