Performance of C++ Parallel Programming Models on Perlmutter using Lulesh¶

In this study, we evaluate Lulesh performance with different C++ parallel programming models on Perlmutter, including OpenMP, HPX, Kokkos, and NVC++ stdpar. We also use different compilers, such as gcc@11.2.0, clang@16.0.0, and nvhpc@22.9, to compile the applications.

Lulesh is a widely used benchmark application that assesses the efficiency of parallel computing architectures in solving partial differential equations related to solid mechanics. For further details about Lulesh, please refer to https://asc.llnl.gov/codes/proxy-apps/lulesh.

If you are interested in any C++ parallel algorithm or require a performance report, please feel free to contact us via help.nersc.gov.

Performance results¶

CPU-based Performance¶

Lulesh benchmark with problem Size 30

Lulesh benchmark with problem Size 60

Lulesh benchmark with problem Size 90

GPU-based Performance¶

Lulesh benchmark with nvhpc gpu (There is no control over the number of threads for NVC++ -stdpar=gpu version.)

Source code used in this study¶

This study utilizes the following open-source repositories, each of which is accompanied by build instructions provided within their repo.

Notes:

1. To obtain correct computation results for NVC++ version, the following changes are needed to the original source code: https://github.com/LLNL/LULESH/pull/24
1. To enable multi-threaded execution for NVC++ version, the extra C++ flag --gcc-toolchain is needed, for example: --gcc-toolchain=/opt/cray/pe/gcc/11.2.0/bin/gcc. The NVC++ -stdpar=gpu version does not provide control over the number of threads.

Example Run Scripts¶

#!/bin/bash

#SBATCH -A $PROJECT_ID 

#SBATCH -C gpu
#SBATCH -t 10:00:00
#SBATCH -q regular
#SBATCH -N 1
#SBATCH --ntasks-per-node=1

#SBATCH -o lulesh.out
#SBATCH -e lulesh.err

for SIZE in 30 60 90 
do
    for NUM_THREADS in 1 2 4 8 16 32 64 128 
    do
        echo "running ref_gcc_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./ref_gcc_openmp -s $SIZE 

        echo "running ref_clang_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./ref_clang_openmp -s $SIZE  

        echo "running hpx_gcc with $SIZE workload and $NUM_THREADS" threads
        ./hpx_gcc -s $SIZE --hpx:threads=$NUM_THREADS

        echo "running hpx_clang with $SIZE workload and $NUM_THREADS" threads
        ./hpx_clang -s $SIZE  --hpx:threads=$NUM_THREADS

        echo "running kokkos_gcc_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./kokkos_gcc_openmp -s $SIZE  

        echo "running kokkos_clang_openmp with $SIZE workload and $NUM_THREADS" threads
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./kokkos_clang_openmp -s $SIZE  

        echo "running lulesh nvc++ multicore with $NUM_THREADS threads and workload $SIZE"
        OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread  OMP_PLACES=threads ./multicoreLulesh2.0 -s $SIZE 
        echo ""
    done

    echo "running lulesh nvc++ gpu with workload $SIZE"
    OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread  OMP_PLACES=threads ./gpuLulesh2.0 -s $SIZE 
    echo ""

    echo "finished running $SIZE workload size"
done

Performance of C++ Parallel Programming Models on Perlmutter using Lulesh¶

Performance results¶

CPU-based Performance¶

GPU-based Performance¶

Source code used in this study¶

Lulesh OpenMP version¶

Lulesh HPX version¶

Lulesh Kokkos version¶

Lulesh NVC++ version¶

Example Run Scripts¶