Checkpointing is the action of saving the state of a running process to a checkpoint image file. The process can later be restarted from the checkpoint file, continuing from where it left off from any computer.
Checkpoint/Restart (C/R) is critical to fault-tolerant computing, and is especially desirable for HPC computing centers like NERSC. From the user perspective, C/R enables jobs to run longer than the walltime limit (e.g., 48 hours on Cori), and improves job throughput by splitting a long running job into multiple shorter ones to better exploit holes in the job schedule created by Slurm. From NERSC's perspective, it offers flexibility in scheduling jobs and system maintenances, enables preempting for time-sensitive jobs (e.g., real time data processing for experimental facilities), and better backfill when draining the system for large jobs, increasing system utilization.
Creating a transparent-to-users C/R tool for HPC applications, however, is challenging, requiring extensive development and maintenance effort due to ever-changing HPC systems and diverse production workloads at all scales. MPI support is especially challenging: the combination of MPI implementations (e.g., MPICH, OpenMPI, Cray MPICH) and networks (e.g., TCP/IP, InfiniBand, Cray Aries) could require maintaining multiple versions of the C/R code. In addition, to enable transparent checkpointing/restarting for users, C/R tools often require cooperation among MPI, OS kernels, and batch system developers, which has proven to be hard to sustain over time. As a result, there are no ready-to-use C/R tools for users who work with cutting-edge HPC computers that often deploy new networks and hardware.
Distributed MultiThreaded Checkpointing (DMTCP) takes a different approach and lives completely in user space. No OS kernel modifications or hooks into MPI libraries are required. A new implementation of DMTCP, MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, has addressed the MPI's MxN maintenance issue, and has been proven to be scalable to a large number of processes. Despite the fact that MANA may need to develop and maintain separate code bases for emerging new hardware, it is a huge step forward toward ready-to-use C/R tools on future HPC platforms!
Both DMTCP and MANA are available on Cori. You are encouraged to checkpoint/restart your MPI jobs with MANA. If you run serial or threaded applications, we recommend that you use DMTCP (the traditional implementation, which does not support Cray MPICH over Aries network) to checkpoint your jobs. The MANA and DMTCP pages have more information about using MANA and DMTCP at NERSC.
NERSC has been in a close collaboration with the DMTCP/MANA team to get DMTCP/MANA to reliably work with production workloads at NERSC. We are also working to enable MANA for our next flagship system Perlmutter now. Please report any issues you encounter with DMTCP/MANA at NERSC's Help Desk.