Checkpointing is the action of saving the state of a running process to a checkpoint image file. The process can later be restarted from the checkpoint file, continuing from where it left off from any computer.
Checkpoint/Restart (C/R) is critical in fault tolerant computing, and is especially desirable for HPC computing centers like NERSC. From the user perspective, C/R enables jobs to run longer than the walltime limit (e.g., 48 hours on Cori), and improves job throughput by splitting a long running job into multiple shorter ones to better exploiting holes in the Slurm schedule. From NERSC's perspective, it offers flexibility in scheduling jobs and system maintenances, enables preempting for time-sensitive jobs (e.g., real time data processing for experimental facilities), and better backfill when draining the system for large jobs, increasing system utilizations.
Creating a transparent-to-users C/R tool for HPC applications, however, is challenging, requiring extensive development and maintenance effort. MPI support is especially challenging: combination of MPI implementations (e.g., MPICH, OpenMPI, Cray MPICH) and networks (e.g., TCP/IP, InfiniBand, Cray Aries) requires maintaining multiple versions of the code (MxN problem). As a result, it is often the case that there are no ready-to-use C/R tools for users who work with front-edge HPC computers that often deploy new networks. In addition, to enable transparent checkpointing/restarting for users, C/R tools often require cooperation among MPI, OS kernels, and batch system developers, proven to be hard to sustain over time.
Distributed MultiThreaded Checkpointing (DMTCP) takes a different approach and lives completely in user space. No OS kernel modifications or hooks into MPI libraries are required. A new implementation of DMTCP, MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, has addressed the MPI's MxN maintainability issue, and has been proven to be scalable to large number of processes. This version will be available to NERSC users soon (target date: Feb 2020).