Interactive Jobs¶
Allocation¶
salloc
is used to allocate resources in real time to run an interactive batch job. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun
commands to launch parallel tasks.
"interactive" QOS on Perlmutter and Cori¶
Perlmutter and Cori have an dedicated interactive QOS to support medium-length interactive work. This queue is intended to deliver nodes for interactive use within 6 minutes of the job request.
Warning
On Cori, if you also have access to Cori GPU nodes, ensure that the cgpu
module is not loaded before submitting interactive jobs to Cori Haswell or KNL nodes. Otherwise, the salloc
command will fail with the following error message:
salloc: error: Job request does not match any supported policy.
salloc: error: Job submit/allocate failed: Unspecified error
Warning
On Perlmutter, if you have not set a default account, salloc may fail with the following error message:
salloc: error: Job request does not match any supported policy.
salloc: error: Job submit/allocate failed: Unspecified error
Perlmutter GPU nodes¶
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 4 --account=mxxxx
When using srun, you must explicitly request for GPU resources
One must use the --gpus
(-G
), --gpus-per-node
, or --gpus-per-task
flag to make the allocated node's GPUs visible to your srun
command.
Otherwise, you will see errors / complaints similar to:
no CUDA-capable device is detected
No Cuda device found
When requesting for an interactive node on the Perlmutter GPU compute nodes
One must use the project name that ends in _g (e.g., mxxxx_g) to submit any jobs to run on the Perlmutter GPU nodes. The -C (constraint flag) must also be set to GPUs for any interactive jobs (-C gpu
or --constraint gpu
).
Otherwise, you will notice errors such as:
sbatch: error: Job request does not match any supported policy.
sbatch: error: Batch job submission failed: Unspecified error
Perlmutter CPU nodes¶
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu --account=mxxxx
Cori Haswell¶
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell
Cori KNL¶
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint knl
Limits¶
On Cori, users in this queue are limited to two jobs running on as many as 64 nodes for up to 4 hours. Additionally, each NERSC project is further limited to a total of 64 nodes between all their interactive jobs (KNL or haswell). This means that if UserA in project m9999 has a job using 1 haswell node, UserB (who is also in project m9999) can have a simultaneous job using 63 haswell nodes or 63 KNL nodes, but not 64 nodes. Since this is intended for interactive work, each user can submit only two jobs at a time (either KNL or haswell). KNL nodes are currently limited to quad,cache mode only. You can run only full node jobs; sub-node jobs like those in the shared queue are not possible.
We have configured this queue to reject the job if it cannot be scheduled within a few minutes. This could be because the job violates the single job per user limit, the total number of nodes per NERSC allocation limit, or because there are not enough nodes available to satisfy the request. In some rare cases, jobs may also be rejected because the batch system is overloaded and wasn't able to process your job in time. If that happens, please resubmit.
Since there is a limit on number of nodes used per allocation on Cori, you may be unable to run a job because other users who share your allocation are using it. To see who in your allocation is using the interactive queue on Cori, you can use
squeue --qos=interactive --account=<project name> -O jobid,username,starttime,timelimit,maxnodes,account
Please coordinate with your group members on provisioning interactive resources for your project if you find that others' usage precludes yours.
You can see the number of nodes that in use (A for allocated) or idle (I) using this command
$ sinfo -p interactive --format="%15b %8D %A"
ACTIVE_FEATURES NODES NODES(A/I)
knl 2 0/0
haswell 192 191/1
knl,cache,quad 190 65/124
Cori "debug" QOS¶
A number of Haswell and Cori compute nodes are reserved for the "debug" QOS, which has a designed quick turnaround time due to the 30-minute wall time limit, the low limit on number of nodes (64 on Haswell, and 512 on KNL), and the low run limit (2 each on Haswell and KNL per user).
The "debug" QOS is intended for code development, testing, and debugging. One common application is to submit a batch interactive job via the salloc
command, such as:
salloc -N 32 -C haswell -q debug -t 20:00
When the requested debug nodes are available, the job will land on the head compute node of the allocated compute nodes pool, and the srun
command can be used to launch parallel tasks interactively.
To run debug jobs on KNL nodes, use -C knl
instead of -C haswell
.
You can run only full node jobs in the debug QOS; sub-node jobs like those in the shared queue are not possible.