Example job scripts¶
For details of terminology used on this page please see our jobs overview. Correct affinity settings are essential for good performance.
The examples on this page focus on Cori's KNL and Haswell architectures.
- For Perlmutter, please see the running jobs on Perlmutter page.
Basic MPI batch script¶
One MPI process per physical core.
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --constraint=haswell
srun check-mpi.intel.cori
Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=68
#SBATCH --constraint=knl
srun check-mpi.intel.cori
Hybrid MPI+OpenMP jobs¶
Warning
In Slurm each hyper thread is considered a "cpu" so the --cpus-per-task
option must be adjusted accordingly. Generally best performance is obtained with 1 OpenMP thread per physical core. Additional details about affinity settings.
Example 1¶
One MPI process per socket and 1 OpenMP thread per physical core
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=32
#SBATCH --constraint=haswell
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=16
srun check-hybrid.intel.cori
Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=272
#SBATCH --constraint=knl
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=68
srun check-hybrid.intel.cori
Example 2¶
28 MPI processes with 8 OpenMP threads per process, each OpenMP thread has 1 physical core
Note
The addition of --cpu-bind=cores
is useful for getting correct affinity settings.
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=7
#SBATCH --ntasks=28
#SBATCH --cpus-per-task=16
#SBATCH --constraint=haswell
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=8
srun --cpu-bind=cores check-hybrid.intel.cori
Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=4
#SBATCH --ntasks=28
#SBATCH --cpus-per-task=32
#SBATCH --constraint=haswell
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=8
srun --cpu-bind=cores check-hybrid.intel.cori
Interactive¶
Interactive jobs are launched with the salloc
command.
Tip
Cori: dedicated nodes for interactive work.
Perlmutter: interactive queue has a higher priority than other QOS's.
Cori Haswell
cori$ salloc --qos=interactive -C haswell --time=60 --nodes=2
Cori KNL
cori$ salloc --qos=interactive -C knl --time=60 --nodes=2
Perlmutter
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 4 --account=mxxxx_g
Note
Please see the interactive section for more details on Cori and Perlmutter's interactive QOS.
Multiple Parallel Jobs Sequentially¶
Multiple sruns can be executed one after another in a single batch script. Be sure to specify the total walltime needed to run all jobs.
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=4
#SBATCH --time=10:00
#SBATCH --licenses=cfs,cscratch1
#SBATCH --constraint=haswell
srun -n 128 -c 2 --cpu_bind=cores ./a.out
srun -n 64 -c 4 --cpu_bind=cores ./b.out
srun -n 32 -c 8 --cpu_bind=cores ./c.out
Tip
Workflow tools are another option to help you run multiple parallel sequential jobs.
Multiple Parallel Jobs Simultaneously¶
Multiple sruns can be executed simultaneously in a single batch script.
Tip
Be sure to specify the total number of nodes needed to run all jobs at the same time.
Note
By default, multiple concurrent srun executions cannot share compute nodes under Slurm in the non-shared QOSs.
In the following example, a total of 192 cores are required, which would hypothetically fit on 192 / 32 = 6 Haswell nodes. However, because sruns cannot share nodes by default, we instead have to dedicate:
- 2 nodes to the first execution (44 cores)
- 4 to the second (108 cores)
- 2 to the third (40 cores)
For all three executables the node is not fully packed and number of MPI tasks per node is not a divisor of 64, so both -c
and --cpu-bind
flags are used in srun
commands.
Note
The "&
" at the end of each srun
command and the wait
command at the end of the script are very important to ensure the jobs are run in parallel and the batch job will not exit before all the simultaneous sruns are completed.
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=8
#SBATCH --time=30:00
#SBATCH --licenses=cscratch1
#SBATCH --constraint=haswell
srun -N 2 -n 44 -c 2 --cpu_bind=cores ./a.out &
srun -N 4 -n 108 -c 2 --cpu_bind=cores ./b.out &
srun -N 2 -n 40 -c 2 --cpu_bind=cores ./c.out &
wait
Tip
Workflow tools are another option to help you run multiple parallel simultaneous jobs.
Job Arrays¶
Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.
This example submits 3 jobs. Each job uses 1 node and has the same time limit and QOS. The SLURM_ARRAY_TASK_ID
environment variable is set to the array index value.
Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --constraint=knl
#SBATCH --time=2
#SBATCH --array=0-2
echo $SLURM_ARRAY_TASK_ID
Additional examples and details
- Slurm job array documentation
- Manual pages via
man sbatch
on NERSC systems
Tip
In many use cases, GNU Parallel is a superior solution to task arrays. This is because the Slurm scheduler prioritizes fewer jobs requesting many nodes ahead of many jobs requesting fewer nodes (array tasks are considered individual jobs). Other workflow tools are available as well.
Dependencies¶
Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps.
Note
The --parsable
option to sbatch
can simplify working with job dependencies.
Example
jobid=$(sbatch --parsable first_job.sh)
sbatch --dependency=afterok:$jobid second_job.sh
Example
jobid1=$(sbatch --parsable first_job.sh)
jobid2=$(sbatch --parsable --dependency=afterok:$jobid1 second_job.sh)
jobid3=$(sbatch --parsable --dependency=afterok:$jobid1 third_job.sh)
sbatch --dependency=afterok:$jobid2,afterok:$jobid3 last_job.sh
Note
A job that is dependent on another job does not accumulate eligible queue wait time before the dependency is satisfied.
Tip
Workflow tools are another option to help you manage job dependencies.
Shared¶
Unlike other QOS's in the shared QOS a single node can be shared by multiple users or jobs. Jobs in the shared QOS are charged for each physical core in allocated to the job.
Tip
In many use cases, GNU Parallel is a superior solution to using a shared QOS. This is because the Slurm scheduler prioritizes fewer jobs requesting many nodes ahead of many jobs requesting fewer nodes.
The number of physical cores allocated to a job by Slurm is controlled by three parameters:
-n
(--ntasks
)-c
(--cpus-per-task
)--mem
- Total memory available to the job (MemoryRequested
)
Note
In Slurm a "cpu" corresponds to a hyperthread. So there are 2 cpus per physical core.
The memory on a node is divided evenly among the "cpus" (or hyperthreads):
System | MemoryPerCpu (megabytes) |
---|---|
Cori | 1952 |
The number of physical cores used by a job is computed by
Cori-Haswell MPI
A two rank MPI job which utilizes 2 physical cores (and 4 hyperthreads) of a Haswell node.
#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
srun --cpu-bind=cores ./a.out
Cori-Haswell MPI/OpenMP
A two rank MPI job which utilizes 4 physical cores (and 8 hyperthreads) of a Haswell node.
#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=2
srun --cpu-bind=cores ./a.out
Cori-Haswell OpenMP
An OpenMP only code which utilizes 6 physical cores.
#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
export OMP_NUM_THREADS=6
./my_openmp_code.exe
Cori-Haswell serial
A serial job should start by requesting a single slot and increase the amount of memory required only as needed to maximize throughput and minimize charge and wait time.
#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1GB
./serial.exe
Intel MPI¶
Applications built with Intel MPI can be launched via srun in the Slurm batch script on Cori compute nodes. The module impi
must be loaded, and the application should be built using the mpiicc
(for C Codes) or mpiifort
(for Fortran codes) or mpiicpc
(for C++ codes) commands.
Cori Haswell
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --time=03:00:00
#SBATCH --nodes=8
#SBATCH --constraint=haswell
module load impi
mpiicc -qopenmp -o mycode.exe mycode.c
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
srun -n 32 -c 16 --cpu-bind=cores ./mycode.exe
Open MPI¶
On Cori, applications built with Open MPI can be launched via srun or Open MPI's mpirun command. The module openmpi
needs to be loaded to build an application against Open MPI. Typically one builds the application using the mpicc
(for C Codes), mpifort
(for Fortran codes), or mpiCC
(for C++ codes) commands. Alternatively, Open MPI supports use of pkg-config
to obtain the include and library paths. For example, pkg-config --cflags --libs ompi-c
returns the flags that must be passed to the backend c
compiler (e.g. gcc, gfortran, icc, ifort) to build against Open MPI. Open MPI also supports Java MPI bindings. Use mpijavac
to compile Java codes that use the Java MPI bindings. For Java MPI, it is highly recommended to launch jobs using Open MPI's mpirun command. Note the Open MPI packages at NERSC do not support static linking.
See Open MPI for more information about using Open MPI on NERSC systems.
Cori Haswell Open MPI
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --constraint=haswell
module load openmpi
/bin/cat <<EOM > ring_c.c
/*
* Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
*
* Simple ring test program in C.
*/
#include <stdio.h>
#include "mpi.h"
int main(int argc, char *argv[])
{
int rank, size, next, prev, message, tag = 201;
/* Start up MPI */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Calculate the rank of the next process in the ring. Use the
modulus operator so that the last process "wraps around" to
rank zero. */
next = (rank + 1) % size;
prev = (rank + size - 1) % size;
/* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
put the number of times to go around the ring in the
message. */
if (0 == rank) {
message = 10;
printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n",
message, next, tag, size);
MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
printf("Process 0 sent to %d\n", next);
}
/* Pass the message around the ring. The exit mechanism works as
follows: the message (a positive integer) is passed around the
ring. Each time it passes rank 0, it is decremented. When
each processes receives a message containing a 0 value, it
passes the message on to the next process and then quits. By
passing the 0 message first, every process gets the 0 message
and can quit normally. */
while (1) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
if (0 == rank) {
--message;
printf("Process 0 decremented value: %d\n", message);
}
MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
if (0 == message) {
printf("Process %d exiting\n", rank);
break;
}
}
/* The last process does one extra send to process 0, which needs
to be received before the program can exit */
if (0 == rank) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
/* All done */
MPI_Finalize();
return 0;
}
EOM
mpicc -o ring_c ring_c.c
mpirun ring_c
#
# run again with srun
#
srun ring_c
Cori KNL Open MPI
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=68
#SBATCH --constraint=knl
module load openmpi
/bin/cat <<EOM > ring_c.c
/*
* Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2006 Cisco Systems, Inc. All rights reserved.
*
* Simple ring test program in C.
*/
#include <stdio.h>
#include "mpi.h"
int main(int argc, char *argv[])
{
int rank, size, next, prev, message, tag = 201;
/* Start up MPI */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Calculate the rank of the next process in the ring. Use the
modulus operator so that the last process "wraps around" to
rank zero. */
next = (rank + 1) % size;
prev = (rank + size - 1) % size;
/* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
put the number of times to go around the ring in the
message. */
if (0 == rank) {
message = 10;
printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n",
message, next, tag, size);
MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
printf("Process 0 sent to %d\n", next);
}
/* Pass the message around the ring. The exit mechanism works as
follows: the message (a positive integer) is passed around the
ring. Each time it passes rank 0, it is decremented. When
each processes receives a message containing a 0 value, it
passes the message on to the next process and then quits. By
passing the 0 message first, every process gets the 0 message
and can quit normally. */
while (1) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
if (0 == rank) {
--message;
printf("Process 0 decremented value: %d\n", message);
}
MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
if (0 == message) {
printf("Process %d exiting\n", rank);
break;
}
}
/* The last process does one extra send to process 0, which needs
to be received before the program can exit */
if (0 == rank) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
/* All done */
MPI_Finalize();
return 0;
}
EOM
mpicc -o ring_c ring_c.c
mpirun ring_c
#
# run again with srun
#
srun ring_c
On Perlmutter, only the mpirun based approach to launching applications compiled against Open MPI is available.
Xfer queue¶
The intended use of the xfer queue is to transfer data between compute systems and HPSS. xfer jobs run on one of the system login nodes and are free of charge. If you want to transfer data to the HPSS archive system at the end of a regular job, you can submit an xfer job at the end of your batch job script. On Cori, this is done via module load esslurm; sbatch -q xfer hsi put <my_files>
. On Perlmutter, this can simply be done with sbatch -q xfer hsi put <my_files>
. xfer jobs can be monitored via module load esslurm; squeue
on Cori and via squeue
on Perlmutter. On either system, the number of running jobs for each user is limited to the number of concurrent HPSS sessions (15).
Tip
On Cori, you must load the esslurm module to access the xfer QOS. xfer jobs on Perlmutter do not require any additional modules.
Warning
Do not run computational jobs in the xfer queue.
Xfer transfer job
#!/bin/bash
#SBATCH --qos=xfer
#SBATCH --time=12:00:00
#SBATCH --job-name=my_transfer
#SBATCH --licenses=SCRATCH
#Archive run01 to HPSS
htar -cvf run01.tar run01
xfer jobs specifying -N nodes
will be rejected at submission time. When submitting an xfer job, the -C
argument is not needed since the job does not run on compute nodes. By default, xfer jobs get 2GB of memory allocated. The memory footprint scales somewhat with the size of the file, so if you're archiving larger files, you'll need to request more memory. You can do this by adding #SBATCH --mem=XGB
to the above script (where X in the range of 5 - 10 GB is a good starting point for large files).
Variable-time jobs¶
After scheduling jobs earlier in the queue, Slurm attemps to fill gaps in the schedule by scanning the remainder of the queue for jobs that can fit in those gaps. If your job can be flexible about the required runtime you can add a --time-min
flag and Slurm will start the job in the first gap larger than the specified --time-min
, thus reducing queue wait time. Slurm will set the time limit for the job to the either the maximum requested time (--time
) or the size of the gap, whichever is smaller.
Tip
Jobs that are capable of checkpoint/restart are ideal candidates for --time-min
.
Pre-terminated jobs can be requeued (or resubmitted) by using the scontrol requeue
command (or sbatch) to resume from where the previous executions left off, until the cumulative execution time reaches the desired time limit or the job completes.
When combined with checkpointing, this allows jobs to accumulate more than the usual 48-hour wallclock limit.
A job that requires 6 hours of wallclock time cannot be used to fill
Variable-time jobs are jobs submitted with a minimum time, #SBATCH --time-min
, in addition to the maximum time (#SBATCH --time
). Here is an example job script for variable-time jobs:
Sample job script with --time-min
#!/bin/bash
#SBATCH -J test
#SBATCH -q flex
#SBATCH -C knl
#SBATCH -N 1
#SBATCH --time=48:00:00 #the max walltime allowed for flex QOS jobs
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#this is an example to run an MPI+OpenMP job:
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=8
srun -n8 -c32 --cpu_bind=cores ./a.out
Using the flex QOS for charging discount for variable-time jobs¶
You can access the flex queue (and a substantial discount) by submitting with -q flex
. You must specify a minimum running time for this job of 2 hours or less with the --time-min
flag. Jobs submitted without the --time-min
flag will be automatically rejected by the batch system. The maximum wall time request limit (requested via --time
or -t
flag) for flex jobs must be greater than 2 hours and cannot exceed 48 hours.
Example
A flex job requesting a minimum time of 1.5 hours, and max wall time of 10 hrs:
sbatch -q flex --time-min=01:30:00 --time=10:00:00 my_batch_script.sl
Tip
Variable-time jobs, specifying a shorter amount of time that a job should run, increase backfill opportunities, meaning users will see a better queue turnaround. In addition, the process of job resubmitting can be automated, so users can run a long job in multiple shorter chunks with a single job script (see the automated job script sample below). However, variable-time jobs incur checkpoint/restart overheads from splitting a longer job into multiple shorter ones. The flex QOS discount aims to compensate for checkpoint-restart overheads.
Note
- The flex QOS has a 75% charging discount on KNL and 50% discount on Haswell. The discount rate is subject to change.
- Variable-time jobs work with any QOS on Cori, but the charging discount is only available with the flex QOS.
Annotated example - automated variable-time jobs¶
A sample job script for variable-time jobs, which automates the process of executing, pre-terminating, requeuing and restarting the job repeatedly until it runs for the desired amount of time or the job completes.
Cori Haswell
#!/bin/bash
#SBATCH -J vtj
#SBATCH -q regular
#SBATCH -C haswell
#SBATCH -N 2
#SBATCH --time=48:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --error=vtj-%j.err
#SBATCH --output=vtj-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --comment=96:00:00 #desired timelimit
#SBATCH --signal=B:USR1@60
#SBATCH --requeue
#SBATCH --open-mode=append
# specify the command to run to checkpoint your job if any (leave blank if none)
ckpt_command=
# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#
# user setting goes here
# srun must execute in the background and catch the signal USR1 on the wait command
srun -n64 -c2 --cpu_bind=cores ./a.out &
wait
Cori KNL
#!/bin/bash
#SBATCH -J vtj
#SBATCH -q flex
#SBATCH -C knl
#SBATCH -N 2
#SBATCH --time=48:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --error=%x-%j.err
#SBATCH --output=%x-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --comment=96:00:00 #desired time limit
#SBATCH --signal=B:USR1@60 #sig_time (60 seconds) should match your checkpoint overhead time
#SBATCH --requeue
#SBATCH --open-mode=append
# specify the command to use to checkpoint your job if any (leave blank if none)
ckpt_command=
# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#
# user setting goes here
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=4
#srun must execute in the background and catch the signal USR1 on the wait command
srun -n32 -c16 --cpu_bind=cores ./a.out &
wait
The --comment
option is used to enter the user’s desired maximum wall-clock time, which could be longer than the maximum time limit allowed by the batch system (96 hours in this example). In addition to the time limit (--time
), the --time-min
option is used to specify the minimum amount of time the job should run (2 hours).
The script setup.sh
defines a few bash functions (e.g., requeue_job
, func_trap
) that are used to automate the process. The requeue_job func_trap USR1
command executes the func_trap
function, which contains a list of actions to checkpoint and requeue the job upon trapping the USR1
signal. Users may want to modify the scripts (get a copy) as needed, although they should work for most applications as they are now.
The job script works as follows:
- User submits the above job script.
- The batch system looks for a backfill opportunity for the job. If it can allocate the requested number of nodes for this job for any duration (e.g., 3 hours) between the specified minimum time (2 hours) and the time limit (48 hours) before those nodes are used for other higher priority jobs, the job starts execution.
- The job runs until it receives a signal USR1 (
--signal=B:USR1@<sig_time
) 60 seconds (sig_time=60
in this example) before it hits the allocated time limit (3 hours). Thesig_time
should match the amount of time (in seconds) needed for checkpointing. - Upon receiving the signal, the job checkpoints and requeues itself with the remaining max time limit before it gets terminated.
- Steps 2-4 repeat until the job runs for the desired amount of time (96 hours) or the job completes.
Note
- If your application requires external triggers or commands to do checkpointing, you need to provide the checkpoint commands using the variable,
ckpt_command
. It could be a script containing several commands to be executed within the specified checkpoint overhead time. - Additionally, if you need to change the job input files to resume the job, you can do so within
ckpt_command
. - If your application does checkpointing periodically, like most molecular dynamics codes do, you don’t need to specify
ckpt_command
(just leave it blank). - You can send the
USR1
signal outside the job script any time using thescancel -b -s USR1 <jobid>
command to terminate the currently running job. The job still checkpoints and requeues itself before it gets terminated. - The
srun
command must execute in the background (notice the&
at the end of the srun command line and thewait
command at the end of the job script), so to catch the signal (USR1
) on the wait command instead ofsrun
, allowsrun
to run for a bit longer (up tosig_time
seconds) to complete the checkpointing.
VASP example¶
VASP atomic relaxation jobs for Cori KNL
#!/bin/bash
#SBATCH -J vt_vasp
#SBATCH -q regular
#SBATCH -C knl
#SBATCH -N 2
#SBATCH --time=48:0:00
#SBATCH --error=%x%j.err
#SBATCH --output=%x%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --comment=96:00:00
#SBATCH --time-min=02:0:00
#SBATCH --signal=B:USR1@300
#SBATCH --requeue
#SBATCH --open-mode=append
# user setting
module load vasp/20181030-knl
export OMP_NUM_THREADS=4
#srun must execute in background and catch signal on wait command
srun -n 32 -c16 --cpu_bind=cores vasp_std &
# put any commands that need to run to prepare for the next job here
ckpt_vasp() {
restarts=`squeue -h -O restartcnt -j $SLURM_JOB_ID`
echo checkpointing the ${restarts}-th job >&2
#to terminate VASP at the next ionic step
echo LSTOP = .TRUE. > STOPCAR
#wait until VASP to complete the current ionic step, write out WAVECAR file and quit
srun_pid=`ps -fle|grep srun|head -1|awk '{print $4}'`
echo srun pid is $srun_pid >&2
wait $srun_pid
#copy CONTCAR to POSCAR
cp -p CONTCAR POSCAR
}
ckpt_command=ckpt_vasp
# requeueing the job if remaining time >0
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
wait
Preemptible Jobs¶
If your application is capable of checkpointing, you may consider using the preemptible queue. The preemptible queue may provide you with results in less overall wallclock time than a series of longer jobs in the regular queue. Jobs submitted to this queue may start faster than ones submitted to the regular queue, since they can be preempted in favor of a higher priority job after a set period of time. The preemptible queue is also typically deeply discounted relative to the regular queue. See QOS limits and charges for the current preemption time and charge factor.
Use the --requeue
flag to tell Slurm to reschedule your job automatically after preemption occurs. Here is an example of a job script that will run a job for a maximum of 96 hours in up to 24 hour increments, with the --requeue
flag set to resubmit the job every time it is preempted:
Perlmutter GPU
#!/bin/bash
#SBATCH -q preempt
#SBATCH -C gpu
#SBATCH -N 1
#SBATCH --time=24:00:00
#SBATCH --error=%x-%j.err
#SBATCH --output=%x-%j.out
#SBATCH --comment=96:00:00 #desired time limit
#SBATCH --signal=B:USR1@60 #sig_time (60 seconds) should match your checkpoint overhead time
#SBATCH --requeue
#SBATCH --open-mode=append
# specify the command to use to checkpoint your job if any (leave blank if none)
ckpt_command=
# user setting and executables go here
When checking on a job submitted to the preemptible queue with sacct
, include the --duplicates
option, since each job execution shares the same job ID.
MPMD (Multiple Program Multiple Data) jobs¶
Run a job with different programs and different arguments for each task. To run MPMD jobs under Slurm use --multi-prog <config_file_name>
.
srun -n 8 --multi-prog myrun.conf
Configuration file format¶
-
Task rank
One or more task ranks to use this configuration. Multiple values may be comma separated. Ranges may be indicated with two numbers separated with a
-
with the smaller number first (e.g.0-4
and not4-0
). To indicate all tasks not otherwise specified, specify a rank of*
as the last line of the file. If an attempt is made to initiate a task for which no executable program is defined, the following error message will be produced:No executable program specified for this task
. -
Executable
The name of the program to execute. May be fully qualified pathname if desired.
-
Arguments
Program arguments. The expression
%t
will be replaced with the task's number. The expression%o
will be replaced with the task's offset within this range (e.g. a configured task rank value of1-5
would have offset values of0-4
). Single quotes may be used to avoid having the enclosed values interpreted. This field is optional. Any arguments for the program entered on the command line will be added to the arguments specified in the configuration file.
Example¶
Sample job script for MPMD jobs. You need to create a configuration file with format described above, and a batch script which passes this configuration file via --multi-prog
flag in the srun command.
Cori-Haswell
cori$ cat mpmd.conf
0-35 ./a.out
36-96 ./b.out
cori$ cat batch_script.sh
#!/bin/bash
#SBATCH -q regular
#SBATCH -N 5
#SBATCH -n 97 # total of 97 tasks
#SBATCH -t 02:00:00
#SBATCH -C haswell
srun --multi-prog ./mpmd.conf
Burst buffer¶
All examples for the burst buffer are shown with Cori Haswell nodes, but burst buffer can also be used with Haswell nodes.
More details about Burst Buffer are available in the dedicated page.
Check the DataWarp limitations
Please note that support for DataWarp has been reduced. The Burst Buffer is also not a persistent storage and a reservation can become unavailable if hardware is unstable. A user reported a data corruption event, detailed in the known issues section of the Burst Buffer documentation page. We invite users to consider using the Cori SCRATCH file system whenever possible. DataWarp is still available for those who benefit from it and recognize the possible risks.
Make sure to understand all limitations of Burst Buffer reported in the Burst Buffer doc page, to avoid losing data and wasting precious compute hours.
Scratch¶
Use the burst buffer as a scratch space to store temporary data during the execution of I/O intensive codes. In this mode all data from the burst buffer allocation will be removed automatically at the end of the job.
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
srun check-mpi.intel.cori > ${DW_JOB_STRIPED}/output.txt
ls ${DW_JOB_STRIPED}
cat ${DW_JOB_STRIPED}/output.txt
Stage in/out¶
Copy the named file or directory into the Burst Buffer, which can then be accessed using $DW_JOB_STRIPED
.
Note
- Only files on the Cori
$SCRATCH
file system can be staged in and stage out only works on Cori$SCRATCH
; if a destination file not on$SCRATCH
is used, the files will be lost - A full path to the file must be used
- You must have permissions to access the file
- The job start may be delayed until the transfer is complete
- Stage out occurs after the job is completed, so there is no charge
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/dwtest-file destination=$DW_JOB_STRIPED/dwtest-file type=file
srun ls ${DW_JOB_STRIPED}/dwtest-file
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_out source=$DW_JOB_STRIPED/output destination=/global/cscratch1/sd/username/output type=directory
mkdir $DW_JOB_STRIPED/output
srun check-mpi.intel.cori > ${DW_JOB_STRIPED}/output/output.txt
Persistent Reservations¶
Persistent reservations are useful when multiple jobs need access to the same files.
Warning
- Reservations must be deleted when no longer in use.
- There are no guarantees of data integrity over long periods of time.
- In Q4 2021 the Burst Buffer suffered several power losses, which caused jobs to get get stuck in the queue waiting for Persistent Reservations to be configured: in this case please create a new PR and resubmit your job.
- If you have multiple jobs writing to the same directory in a Persistent Reservation, you will run into race conditions due to the DataWarp caching. The second job will likely fail with
Permission denied
orNo such file or directory
messages. - See the Burst Buffer dedicated page for more details about known issues.
Note
Each persistent reservation must have a unique name. Check the existing PRs with scontrol show burst
.
Create¶
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#BB create_persistent name=PRname capacity=100GB access_mode=striped type=scratch
Use¶
Take care if multiple jobs will be using the reservation to not overwrite data.
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#DW persistentdw name=PRname
ls $DW_PERSISTENT_STRIPED_PRname/
Destroy¶
Any data on the reservation at the time the script executes will be removed.
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#BB destroy_persistent name=PRname
Interactive¶
The burst buffer is available also in interactive sessions. It is recommended to use a configuration file for the burst buffer directives:
cori$ cat bbf.conf
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file
cori$ salloc --qos=interactive -C haswell -t 00:30:00 --bbf=bbf.conf
Large Memory¶
There are two nodes on Cori, cori22
and cori23
, with 750 GB of memory that can be used for jobs that require very high memory per node. There are only two nodes, so this resource is limited and should only be used for jobs that require high memory.
Cori Example
A sample bigmem job which needs only one core.
#!/bin/bash
#SBATCH --clusters=escori
#SBATCH --qos=bigmem
#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --job-name=my_big_job
#SBATCH --licenses=SCRATCH
srun -n 1 ./my_big_executable
Realtime¶
The "realtime" QOS is used for running jobs with the need of getting realtime turnaround time. This is only intended for jobs that are connected with an external realtime component (e.g. live beamline runs, telescope time, etc.).
Note
Use of this QOS requires special approval, and is only intended for use with a live, external realtime component that needs on-demand resources. There are limited resources available on this queue. It is not intended to provide faster batch turnaround for regular jobs.
The realtime QOS is a user-selective shared QOS, meaning you can request either exclusive node access (with the #SBATCH --exclusive
flag) or allow multiple applications to share a node (with the #SBATCH --share
flag).
Tip
It is recommended to allow sharing the nodes so more jobs can be scheduled in the allocated nodes. Sharing a node is the default setting, and using #SBATCH --share
is optional.
Example
Uses two full nodes
#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job
#SBATCH --licenses=cfs
#SBATCH --exclusive
srun --cpu-bind=cores ./mycode.exe # pure MPI, 64 MPI tasks
If you are requesting only a portion of a single node, please add --gres=craynetwork:0
as follows to allow more jobs on the node. Similar to using the "shared" QOS, you can request number of slots on the node (total of 64 CPUs, or 64 slots) by specifying the -ntasks
and/or --mem
. The rules are the same as the shared QOS.
Example
Two MPI ranks running with 4 OpenMP threads each. The job is using in total 8 physical cores (8 "cpus" or hyperthreads per "task") and 10GB of memory.
#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --gres=craynetwork:0
#SBATCH --cpus-per-task=8
#SBATCH --mem=10GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job2
#SBATCH --licenses=cfs
#SBATCH --shared
export OMP_NUM_THREADS=4
srun --cpu-bind=cores ./mycode.exe
Example
OpenMP only code running with 6 threads. Note that srun
is not required in this case.
#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=craynetwork:0
#SBATCH --cpus-per-task=12
#SBATCH --mem=16GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job3
#SBATCH --licenses=cfs,SCRATCH
#SBATCH --shared
export OMP_NUM_THREADS=6
./mycode.exe
Multiple Parallel Jobs While Sharing Nodes¶
Under certain scenarios, you might want two or more independent applications running simultaneously on each compute node allocated to your job. For example, a pair of applications that interact in a client-server fashion via some IPC mechanism on-node (e.g. shared memory), but must be launched in distinct MPI communicators.
This latter constraint would mean that MPMD mode (see below) is not an appropriate solution, since although MPMD can allow multiple executables to share compute nodes, the executables will also share an MPI_COMM_WORLD
at launch.
Slurm can allow multiple executables launched with concurrent srun calls to share compute nodes as long as the sum of the resources assigned to each application does not exceed the node resources requested for the job. Importantly, you cannot over-allocate the CPU, memory, or "craynetwork" resource. While the former two are self-explanatory, the latter refers to limitations imposed on the number of applications per node that can simultaneously use the Aries interconnect, which is currently limited to 4.
Here is an example of an sbatch script that uses two compute nodes and runs two applications concurrently. One application uses 8 cores on each node, while the other uses 24 on each node. The number of tasks per node is controlled with the -n
and -N
flags and the amount of memory per node with the --mem
flag. To specify the "craynetwork" resource, we use the --gres
flag available in both sbatch
and srun
. The --overlap
flag is needed to allow overlap on the assigned resources with other job steps.
Cori Haswell
#!/bin/bash
#SBATCH -q regular
#SBATCH -N 2
#SBATCH -t 12:00:00
#SBATCH --gres=craynetwork:2
#SBATCH -L SCRATCH
#SBATCH -C haswell
srun -N 2 -n 16 -c 2 --mem=51200 --gres=craynetwork:1 --overlap ./exec_a &
srun -N 2 -n 48 -c 2 --mem=61440 --gres=craynetwork:1 --overlap ./exec_b &
wait
This example is quite similar to the multiple srun jobs shown for running simultaneous parallel jobs, with the following exceptions:
-
For our sbatch job, we have requested
--gres=craynetwork:2
which will allow us to run up to two applications simultaneously per compute node. -
In our
srun
calls, we have explicitly defined the maximum amount of memory available to each application per node with--mem
(in this example 50 and 60 GB, respectively) such that the sum is less than the resource limit per node (roughly 122 GB). -
In our
srun
calls, we have also explicitly used one of the two requested craynetwork resources per call. -
In our srun calls, we need to use the
--overlap
flag to allow multiple sruns to share resources on the same nodes with other job steps.
Using this combination of resource requests, we are able to run multiple parallel applications per compute node.
Note
It is permitted to specify srun --gres=craynetwork:0
which will not count against the craynetwork resource. This is useful when, for example, launching a bash script or other application that does not use the interconnect. We don't currently anticipate this being a common use case, but if your application(s) do employ this mode of operation it would be appreciated if you let us know.
Tip
Workflow tools are another option to help you run multiple parallel jobs while sharing nodes.
Compile¶
The compile
QOS is intended for compiling codes and should be used in workflows that regularly build from source code. These jobs are submitted to a special queue that can be queried by loading the module esslurm
or passing flags in the command line -M escori
.
Note
All compile queue jobs run on a single Haswell node, please be mindful of the resources requested for a job.
Example
A sample compile job.
#!/bin/bash
#SBATCH --clusters=escori
#SBATCH --qos=compile
#SBATCH --job-name=my_compile_job
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8GB
make -j 4
Heterogeneous Jobs¶
Slurm is able to submit and manage a single job which contains several components consisting of different job options. The individual components of a heterogeneous job can select almost all of the slurm job options. Heterogeneous jobs can be useful if parts of a job have different requirements. For example, part of a job might require 4 GPUs whilst the other part of the job requires 256 CPU cores. Likewise, parts of a job may have different memory per cpu requirements and therefore benefit from deploying a heterogeneous job.
Example
A sample heterogeneous perlmutter job: utilising both the CPU and GPU compute nodes.
#!/bin/bash
#SBATCH -A <account>
#SBATCH --qos=regular
#SBATCH --time=05:00:00
#SBATCH --constraint=cpu
#SBATCH --nodes=2
#SBATCH hetjob
#SBATCH --constraint=gpu
#SBATCH --nodes=1
srun --het-group=0 cpu_script.sh
srun -G 4 --het-group=1 gpu_script.sh
Each component of the job should be separated by the #SBATCH hetjob
line in the slurm script (as shown above). The --het-group
option in srun
defines which component(s) are to have applications launched for them. Slurm heterogeneous jobs do support multiple components and each component will appear in squeue
.
There is also syntax for salloc
, sbatch
and srun
commands. The character :
is used to separate each component request. See example below:
sbatch --cpus-per-task=4 --ntasks=128 : \
--cpus-per-task=1 --ntasks=1 my_batch_script.sl
For more information on heterogeneous slurm jobs visit their support documentation page.
Projects that have exhausted their allocation¶
A project with zero or negative NERSC-hours balance can submit to the the overrun queue.
If you meet the overrun criteria, you can access the overrun queue by submitting with -q overrun
(-q shared_overrun
for the shared queue). In addition, on Cori, you must specify a minimum running time for this job of 4 hours or less with the --time-min
flag. Jobs submitted without these flags will be automatically rejected by the batch system. On Perlmutter, all overrun jobs are subject to preemption by higher priority workloads under certain circumstances.
Tip
We recommend you implement checkpoint/restart your overrun jobs to save your progress.
Example
A job requesting a minimum time of 1.5 hours:
sbatch -q overrun --time-min=01:30:00 my_batch_script.sl
Additional information¶
- sbatch documentation
- Manual pages (
man sbatch
on NERSC systems)