Skip to content

Performance variability

There are many potential sources of variability on an HPC system and NERSC has identified the following best practices to mitigate variability and improve application performance.

hugepages

Use of hugepages can reduce the cost of accessing memory, especially in the case of many MPI_Alltoall operations.

  1. Load the hugepages module (module load craype-hugepages2M).
  2. Recompile your code.
  3. Add module load craype-hugepages2M to batch scripts.

Note

Consider adding module load craype-hugepages2M to ~/.bashrc.

For more details see the manual pages (man intro_hugepages).

Location of executables

Compilation of executables should be done in $HOME or /tmp. Executables can be copied into the compute node memory at the start of a job with sbcast to greatly improve job startup times and reduce run-time variability in some cases.

For applications with dynamic executables and many libraries (especially python based applications) use Shifter.

Network Congestion

Sometimes, due to other communication-intensive workloads running at the same time as your workload, there may be variation in the amount of time spent on communication. There are Cray MPI environment variables that can be set to change the strategy used by the system to route messages in your job. The Network page provides more details on these environment variables.

Affinity

Running with correct affinity and binding options can greatly affect variability.

  • use at least 8 ranks per node (1 rank per node cannot utilize the full network bandwidth)
  • read man intro_mpi for additional options
  • check job script generator to get correct binding
  • use check-mpi..pm and check-hybrid..pm, where can be gnu, nvidia, or cce to check affinity settings
elvis@perlmutter$ salloc -N 2 -C cpu -q interactive -t 10:00
salloc: Granted job allocation 9887582
salloc: Waiting for resource configuration
salloc: Nodes nid[004434,005440] are ready for job
elvis@nid004434$ srun -n 8 -c 64 --cpu-bind=cores check-mpi.gnu.pm|sort -nk 4
Hello from rank 0, on nid004434. (core affinity = 0-31,128-159)
Hello from rank 1, on nid004434. (core affinity = 64-95,192-223)
Hello from rank 2, on nid004434. (core affinity = 32-63,160-191)
Hello from rank 3, on nid004434. (core affinity = 96-127,224-255)
Hello from rank 4, on nid005440. (core affinity = 0-31,128-159)
Hello from rank 5, on nid005440. (core affinity = 64-95,192-223)
Hello from rank 6, on nid005440. (core affinity = 32-63,160-191)
Hello from rank 7, on nid005440. (core affinity = 96-127,224-255)

Core specialization

Using core-specialization (#SBATCH -S n or #SBATCH --core-spec=n) moves OS functions to cores not in use by user applications, where n is the number of cores to dedicate to the OS. The flag only works in a batch script with sbatch. It can not be requested as a flag with salloc for interactive jobs, since salloc is already a wrapper script for srun.

The example shows 1 core per node on Perlmutter CPU for the OS and the other 127 for the application. Note that, when computing the -c (or --cpus-per-task) value using a formula provided in the affinity page, cores for the OS should be excluded from the numerator. So the -c value is 2*\left \lfloor{(128-1)/(32/2)}\right \rfloor = 14.

#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH -S 1
srun -n 32 -c 14 --cpu-bind=cores /tmp/my_program.x

Combined example

This example is for Perlmutter CPU.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH --qos=regular
#SBATCH --time=60
#SBATCH --core-spec=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=14

module load craype-hugepages2M

sbcast -f  --compress ./my_program.x /tmp/my_program.x
srun --cpu-bind=cores /tmp/my_program.x