Skip to content

ISO C++ Parallel STL Benchmark on Perlmutter

This is a brief analysis of C++ parallel algorithms on Perlmutter. This page provides performance summary and sample code. For details (source code, how to build and run), please refer to this Github repo: https://github.com/weilewei/parSTL.

The main focus of this benchmark is on C++ parallel transform and sort algorithms, which are available in several parallel frameworks such as Intel HPX, Kokkos, TBB, gnu, and nvhpc. To conduct the benchmark, a vector of random numbers is first allocated and then subjected to a range of parallel algorithms. We demonstrate how parallel Standard Template Library (STL) algorithms can be used on Perlmutter, and how well different implementations perform.

If you're interested in any pSTL algorithm or need a performance report, feel free to contact us via help.nersc.gov.

Parallel Transform and Sort on Perlmutter

Parallel Transform with gcc/clang/nvhpc/gnu on Perlmutter

Transform with GCC@11.2.0 Transform with GCC@11.2.0

Transform with Clang@16.0.0 Transform with Clang@16.0.0

Transform with NVC++@22.7 Multicore Transform with NVC++@22.7 Multicore

Parallel Sort with gcc/clang/nvhpc/gnu on Perlmutter

Sort with GCC@11.2.0 Sort with GCC@11.2.0

Sort with Clang@16.0.0 Sort with Clang@16.0.0

Sort with NVC++@22.7 Multicore Sort with NVC++@22.7 Multicore

Transform and sort for applications that have no control thread count

Transform Transform with NVC++ gpu, Standard C++ with TBB and gcc/clang

sort Sort with NVC++ gpu, Standard C++ with TBB and gcc/clang

Example code

  • Standard C++ parallel transform and NVC++ parallel transform
std::transform(std::execution::seq, workVec.begin(), workVec.end(),
    workVec.begin(), [](double arg){ return std::tan(arg); });

std::transform(std::execution::par, workVec.begin(), workVec.end(),
    workVec.begin(), [](double arg){ return std::tan(arg); });

std::transform(std::execution::par_unseq, workVec.begin(), workVec.end(),
    workVec.begin(), [](double arg){ return std::tan(arg); });
  • HPX parallel transform
hpx::transform(hpx::execution::seq, workVec.begin(), workVec.end(),
    workVec.begin(), [](double arg){ return std::tan(arg); });
hpx::transform(hpx::execution::par, workVec.begin(), workVec.end(),
    workVec.begin(), [](double arg){ return std::tan(arg); });
hpx::transform(hpx::execution::par_unseq, workVec.begin(), workVec.end(),
    workVec.begin(), [](double arg){ return std::tan(arg); });
  • Kokkos parallel transform
Kokkos::parallel_for("kokkos::parallel_for transform optimized version", 
  Kokkos::RangePolicy<Kokkos::IndexType<int>, Kokkos::Schedule<Kokkos::Dynamic>>
  (0, length), KOKKOS_LAMBDA (const int& i) {
    workVec(i) = std::tan(workVec(i));
});
  • GNU parallel transform
__gnu_parallel::transform(workVec.begin(), workVec.end(), 
    workVec.begin(), [](double arg){ return std::tan(arg); });
  • Taskflow parallel transform
tf::Executor executor(num_threads);

tf::Taskflow t1;
t1.for_each(workVec.begin(), workVec.end(), [] (double& arg) {
    arg = std::tan(arg);});