Checkpoint / Restart¶
Checkpointing is the act of saving the state of a running process to a checkpoint image file. The process can later be restarted from the checkpoint file, continuing from where it left off from any computer.
Checkpoint/Restart (C/R) is critical to fault-tolerant computing, and is especially desirable for HPC computing centers like NERSC. From the user perspective, C/R enables jobs to run longer than the walltime limit (e.g., 48 hours on Cori), and improves job throughput by splitting a long running job into multiple shorter ones to better exploit holes in the job schedule created by Slurm. From NERSC's perspective, it offers flexibility in scheduling jobs and system maintenances, enables preempting for time-sensitive jobs (e.g., real time data processing for experimental facilities), and better backfill when draining the system for large jobs, increasing system utilization.
Creating a transparent-to-users C/R tool for HPC applications, however, is challenging, requiring extensive development and maintenance effort. MPI support is especially challenging: the combination of MPI implementations (e.g., MPICH, OpenMPI, Cray MPICH) and networks (e.g., TCP/IP, InfiniBand, Cray Aries) could require maintaining multiple versions of the code (MxN problem). In addition, to enable transparent checkpointing/restarting for users, C/R tools often require cooperation among MPI, OS kernels, and batch system developers, which has proven to be hard to sustain over time. As a result, there are no ready-to-use C/R tools for users who work with cutting-edge HPC computers that often deploy new networks.
Distributed MultiThreaded Checkpointing (DMTCP) takes a different approach and lives completely in user space. No OS kernel modifications or hooks into MPI libraries are required. A new implementation of DMTCP, MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, has addressed the MPI's MxN maintainability issue, and has been proven to be scalable to large number of processes. This version is nearly complete and will be available to NERSC users soon.