Skip to content

Performance of C++ Parallel Programming Models on Perlmutter using Lulesh

In this study, we evaluate Lulesh performance with different C++ parallel programming models on Perlmutter, including OpenMP, HPX, Kokkos, and NVC++ stdpar. We also use different compilers, such as gcc@11.2.0, clang@16.0.0, and nvhpc@22.9, to compile the applications.

Lulesh is a widely used benchmark application that assesses the efficiency of parallel computing architectures in solving partial differential equations related to solid mechanics. For further details about Lulesh, please refer to https://asc.llnl.gov/codes/proxy-apps/lulesh.

If you are interested in any C++ parallel algorithm or require a performance report, please feel free to contact us via help.nersc.gov.

Performance results

CPU-based Performance

Lulesh benchmark with problem Size 30 Lulesh benchmark with problem Size 30

Lulesh benchmark with problem Size 60 Lulesh benchmark with problem Size 60

Lulesh benchmark with problem Size 90 Lulesh benchmark with problem Size 90

GPU-based Performance

Lulesh benchmark with nvhpc gpu Lulesh benchmark with nvhpc gpu (There is no control over the number of threads for NVC++ -stdpar=gpu version.)

Source code used in this study

This study utilizes the following open-source repositories, each of which is accompanied by build instructions provided within their repo.

Lulesh OpenMP version

Lulesh-OpenMP

Lulesh HPX version

Lulesh-HPX

Lulesh Kokkos version

Lulesh-Kokkos

Lulesh NVC++ version

Lulesh-nvc stdpar

Notes:

    1. To obtain correct computation results for NVC++ version, the following changes are needed to the original source code: https://github.com/LLNL/LULESH/pull/24
    1. To enable multi-threaded execution for NVC++ version, the extra C++ flag --gcc-toolchain is needed, for example: --gcc-toolchain=/opt/cray/pe/gcc/11.2.0/bin/gcc. The NVC++ -stdpar=gpu version does not provide control over the number of threads.

Example Run Scripts

#!/bin/bash

#SBATCH -A $PROJECT_ID 

#SBATCH -C gpu
#SBATCH -t 10:00:00
#SBATCH -q regular
#SBATCH -N 1
#SBATCH --ntasks-per-node=1

#SBATCH -o lulesh.out
#SBATCH -e lulesh.err

for SIZE in 30 60 90 
do
    for NUM_THREADS in 1 2 4 8 16 32 64 128 
    do
        echo "running ref_gcc_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./ref_gcc_openmp -s $SIZE 

        echo "running ref_clang_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./ref_clang_openmp -s $SIZE  

        echo "running hpx_gcc with $SIZE workload and $NUM_THREADS" threads
        ./hpx_gcc -s $SIZE --hpx:threads=$NUM_THREADS

        echo "running hpx_clang with $SIZE workload and $NUM_THREADS" threads
        ./hpx_clang -s $SIZE  --hpx:threads=$NUM_THREADS

        echo "running kokkos_gcc_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./kokkos_gcc_openmp -s $SIZE  

        echo "running kokkos_clang_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./kokkos_clang_openmp -s $SIZE  

        echo "running lulesh nvc++ multicore with $NUM_THREADS threads and workload $SIZE"
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread  OMP_PLACES=threads ./multicoreLulesh2.0 -s $SIZE 
        echo ""
    done

    echo "running lulesh nvc++ gpu with workload $SIZE"
    OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread  OMP_PLACES=threads ./gpuLulesh2.0 -s $SIZE 
    echo ""

    echo "finished running $SIZE workload size"
done