Local temporary file system in memory¶

The memory of compute nodes can also be used to store data using a file system-like access. Every Linux OS mounts (part of) the system memory under the path /dev/shm/, where any user can write into, similarly to the /tmp/ directory. This kind of file system may help in case of multiple accesses to the same files or in case of very small files, which are usually troublesome for parallel file systems.

Warning

Writing to /dev/shm reduces the memory available to the OS and may cause compute node to go Out Of Memory (OOM), which will kill your processes and interrupt your job and/or crash the compute node itself.

Each architecture of compute nodes at NERSC ships with different memory layouts, so the advice is to first inspect the memory each architecture reserves to /dev/shm in an interactive session using df -h /dev/shm (by default the storage space reserved to tmpfs is half the physical RAM installed). Note that /dev/shm is a file system local to each node, so no shared file access is possible across multiple nodes of a job.

Since data is purged after every job is completed, users cannot expect their data to persist across jobs, and will be required to manually stage in their data into /dev/shm before every execution and stage it out before completing. If this data movement involves several small files, the best approach would be to create an archive containing all the files beforehand (e.g. on a DTN node, to avoid wasting precious compute time), then inside the job extract the data from the archive into /dev/shm: this minimizes the number of accesses to small files on the parallel file systems and produce instead large contiguous file accesses.

Example stage-in¶

For example, let’s assume several small input files are needed to bootstrap your jobs, and are stored in your scratch directory at $SCRATCH/files/. Here’s how you could produce a compressed archive input.tar.gz (note that the $SCRATCH variable is not expanded in the dtn nodes):

ssh dtn03.nersc.gov

cd /global/cscratch1/sd/$USER/
tar -czf input.tar.gz files/

Now you can unarchive it in /dev/shm when inside a job.

Note that there may already be some system directories in /dev/shm which may cause your process to misbehave: for this reason you may want to create a subdirectory only for you and unarchive your files in there, when inside your job:

mkdir /dev/shm/$USER
tar -C /dev/shm/$USER -xf $SCRATCH/input.tar.gz

Example stage-out¶

A similar approach to the stage-in needs to be taken before the job completion, in order to store important files created by the job. For example, if a job created files in /dev/shm/$USER/, we may want to archive and compress them into a single file with:

cd /dev/shm/$USER/
tar -czf $SCRATCH/output_collection/output.tar.gz .

MPI example jobs¶

When dealing with multiple nodes using MPI, only one process per node has to create directories or create archives, to avoid collisions or data corruption.

The following example creates a directory /dev/shm/$SLURM_JOBID on each node, runs a mock application that generates multiple files in /dev/shm/$SLURM_JOBID/ and finally creates a tarball archive of the data from each node, storing it in $SCRATCH/outputs/:

#!/bin/bash
#SBATCH ...  # here go all the slurm configuration options of your application
set -e       # Exit on first error

export OUTDIR="$SCRATCH/outputs/$SLURM_JOBID"
export LOCALDIR="/dev/shm/$SLURM_JOBID"
export CLEANDIR="$SCRATCH/cleaned_outputs/"
mkdir -p "$OUTDIR" "$CLEANDIR"

# Create the local directory in /dev/shm, using one process per node
srun --ntasks $SLURM_NNODES --ntasks-per-node 1 mkdir -p "$LOCALDIR"

# The following is just an example, it creates 1 small file per process in $LOCALDIR/$RANDOM
# Substitute with your application, and make it create files in $LOCALDIR
srun bash -c 'hostname >$LOCALDIR/$RANDOM'

# And finally send one "collecting" process to archive all local directories into separate archives
# We have to use 'bash -c' because 'hostname' needs to be interpreted on each node separately
srun --ntasks $SLURM_NNODES --ntasks-per-node 1 bash -c 'tar -cf "$OUTDIR/output_$(hostname).tar" -C "$LOCALDIR" .'

You may also want to concatenate these archives into a single one for easier analysis (note that only uncompressed archives can be concatenated). To do so you can add this line after the last 'srun' above:

tar -Af "$(/usr/bin/ls -1 $OUTDIR)" && cp -a "$(/usr/bin/ls -1 $OUTDIR |head -1)" "$CLEANDIR/$SLURM_JOBID.tar"

The line above will make all the nodes of your job wait for this single process, therefore "wasting" compute hours.

Alternatively you can use a separate job (e.g. a shared job using a single core) or manually using the data transfer nodes, so not to waste compute resources, especially if the archives are large. Here's a separate "aggregator" script:

#!/bin/bash
#SBATCH ...  # here go all the slurm configuration options of your application
set -e       # Exit on first error

# Get name of directory containing files to be merged
[[ $# -ne 1 ]] && echo "Error. Missing input arg: DIRECTORY" && exit 1 || cd $1

export CLEANDIR="$SCRATCH/cleaned_outputs/"
mkdir -p "$CLEANDIR"

# Concatenate all *.tar archives into a single one using the name of the
# current dir (the job id) as the new name.
cat *.tar > "$CLEANDIR/$(basename $PWD).tar"

This "aggregator" job should be submitted after the first script has completed, or you can use a for loop to iterate over all the output directories, like this:

for d in $SCRATCH/outputs/*; do sbatch aggregator.slurm "$d"; done

Final notes¶

The user will need to pay attention at the memory usage on the node: storing too much data in a tmpfs on memory may force the kernel to kill running processes and/or cause the node to crash if not enough memory is left available.

Also important to note is that /dev/shm, being volatile memory, does not offer any fault tolerance solution, and a node crash will cause the data to be lost: see also our documentation on Checkpointing for solutions.

If you're creating large archives (over the GB threshold) please consider striping the scratch directory where you will create the archive.

A similar solution is to use temporary XFS file systems on top of Lustre, when using shifter containers.