Skip to content

Interactive Jobs

Allocation

salloc is used to allocate resources in real time to run an interactive batch job. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

"interactive" QOS on Perlmutter and Cori

Perlmutter and Cori have an dedicated interactive QOS to support medium-length interactive work. This queue is intended to deliver nodes for interactive use within 6 minutes of the job request.

Warning

On Cori, if you also have access to Cori GPU nodes, ensure that the cgpu module is not loaded before submitting interactive jobs to Cori Haswell or KNL nodes. Otherwise, the salloc command will fail with the following error message:

salloc: error: Job request does not match any supported policy.
salloc: error: Job submit/allocate failed: Unspecified error

Warning

On Perlmutter, if you have not set a gpu-compatible default account, salloc will fail with the following error message:

salloc: error: Job request does not match any supported policy.
salloc: error: Job submit/allocate failed: Unspecified error

Perlmutter GPU nodes

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 4 --account=mxxxx_g

When using srun, you must explicitly request for GPU resources

One must use the --gpus (-G), --gpus-per-node, or --gpus-per-task flag to make the allocated node's GPUs visible to your srun command.

Otherwise, you will see errors / complaints similar to:

 no CUDA-capable device is detected

 No Cuda device found

When requesting for an interactive node on the Perlmutter GPU compute nodes

One must use the project name that ends in _g (e.g., mxxxx_g) to submit any jobs to run on the Perlmutter GPU nodes. The -C (constraint flag) must also be set to GPUs for any interactive jobs (-C gpu or --constraint gpu).

Otherwise, you will notice errors such as:

sbatch: error: Job request does not match any supported policy.
sbatch: error: Batch job submission failed: Unspecified error

Perlmutter CPU nodes

Note that the account for CPU nodes does not have the trailing _g:

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu --account=mxxxx

Cori Haswell

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell

Cori KNL

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint knl

Limits

On Cori, users in this queue are limited to two jobs running on as many as 64 nodes for up to 4 hours. Additionally, each NERSC project is further limited to a total of 64 nodes between all their interactive jobs (KNL or haswell). This means that if UserA in project m9999 has a job using 1 haswell node, UserB (who is also in project m9999) can have a simultaneous job using 63 haswell nodes or 63 KNL nodes, but not 64 nodes. Since this is intended for interactive work, each user can submit only two jobs at a time (either KNL or haswell). KNL nodes are currently limited to quad,cache mode only. You can run only full node jobs; sub-node jobs like those in the shared queue are not possible.

We have configured this queue to reject the job if it cannot be scheduled within a few minutes. This could be because the job violates the single job per user limit, the total number of nodes per NERSC allocation limit, or because there are not enough nodes available to satisfy the request. In some rare cases, jobs may also be rejected because the batch system is overloaded and wasn't able to process your job in time. If that happens, please resubmit.

Since there is a limit on number of nodes used per allocation on Cori, you may be unable to run a job because other users who share your allocation are using it. To see who in your allocation is using the interactive queue on Cori, you can use

squeue --qos=interactive --account=<project name> -O jobid,username,starttime,timelimit,maxnodes,account

Please coordinate with your group members on provisioning interactive resources for your project if you find that others' usage precludes yours.

You can see the number of nodes that in use (A for allocated) or idle (I) using this command

$ sinfo -p interactive --format="%15b %8D %A"
ACTIVE_FEATURES NODES    NODES(A/I)
knl             2        0/0
haswell         192      191/1
knl,cache,quad  190      65/124

Cori "debug" QOS

A number of Haswell and Cori compute nodes are reserved for the "debug" QOS, which has a designed quick turnaround time due to the 30-minute wall time limit, the low limit on number of nodes (64 on Haswell, and 512 on KNL), and the low run limit (2 each on Haswell and KNL per user).

The "debug" QOS is intended for code development, testing, and debugging. One common application is to submit a batch interactive job via the salloc command, such as:

salloc -N 32 -C haswell -q debug -t 20:00

When the requested debug nodes are available, the job will land on the head compute node of the allocated compute nodes pool, and the srun command can be used to launch parallel tasks interactively.

To run debug jobs on KNL nodes, use -C knl instead of -C haswell.

You can run only full node jobs in the debug QOS; sub-node jobs like those in the shared queue are not possible.