Workflow Management Tools¶
Supporting data-centric science involves the movement of data, multi-stage processing, and visualization at scales where manual control becomes prohibitive and automation is needed. Workflow technologies can improve the productivity and efficiency of data-centric science by orchestrating and automating these steps.
Let us help you find the right tool!
Do you have questions about how to choose the right workflow tool for your application? Are you unsure about which tools will work on NERSC systems? Please open a ticket at help.nersc.gov, explain you would like help choosing a workflow tool, and your ticket will be routed to experts who can help you.
A NERSC working group review and refresh of this content is currently in progress; we are actively updating these docs pages with new information as we evaluate new tools. In the meantime we request the following of users considering workflow management solutions:
- Before you begin developing a codebase which requires a particular workflow manager, please contact NERSC consultants via help.nersc.gov to confirm it can be effectively used at NERSC. Some tools have infrastructure needs or operate in a manner which is fundamentally incompatible with NERSC systems and we'd like to protect users from wasting effort if we can.
- Please do not write your own workflow manager. More than 200 such solutions already exist and almost certainly one of them can be found which will fit your needs and our infrastructure.
- Please don't do this!!!
For i=1=10,000 srun -n 1 a.out
sruns in a short period of time really stresses our SLURM scheduler. It will ruin not only your own job performance but also the performance for all other NERSC users, too. If this is what you need for your application, please consider a workflow tool. This is what they were designed to do!
GNU Parallel is a shell tool for executing commands in parallel and in sequence on a single node. Parallel is a very usable and effective tool for running High Throughput Computing workloads without data dependencies at NERSC. Following simple Slurm command patterns allows parallel to scale up to running tasks in job allocations with multiple nodes.
TaskFarmer is a utility developed at NERSC to distribute single-node tasks across a set of compute nodes - these can be single- or multi-core tasks. TaskFarmer tracks which tasks have completed successfully, and allows straightforward re-submission of failed or un-run jobs from a checkpoint file.
FireWorks is a free, open-source code for defining, managing, and executing scientific workflows. It can be used to automate calculations over arbitrary computing resources, including those that have a queueing system. Some features that distinguish FireWorks are dynamic workflows, failure-detection routines, and built-in tools and execution modes for running high-throughput computations at large computing centers. It uses a centralized server model, where the server manages the workflows and workers run the jobs.
Papermill is a tool that allows users to run Jupyter notebooks 1) via the command line and 2) in an easily parameterizable way. Papermill is best suited for Jupyter users who would like to run the same notebook with different input values.
Parsl is a Python library for programming and executing data-oriented workflows in parallel. It lets you express complicated workflows with task and data dependencies in a single Python script. Parsl is made with HPC in mind, scales well, and runs on many HPC platforms. Under the hood, Parsl uses a driver or master process to orchestrate the work. Data and tasks are serialized and communicated bidirectional with worker process using ZeroMQ sockets. The workers are organized in worker pools and launched on the compute infrastructure.
Snakemake is a tool that combines the power of Python with shell scripting. It allows users to define workflows with complex dependencies; users can easily visualize the job dependency graph and track which tasks have been completed and are still pending. Snakemake works best at NERSC for single node jobs.
Cron jobs have long been the basic building block of most workflows. On Perlmutter, cron jobs have been replaced with
scrontab which runs jobs at your chosen periodicity via our Slurm batch system. This combines the same functionality as
cron with the resiliency of the batch system. Jobs are run on a pool of nodes, so unlike with regular
cron, a single node going down won't keep your
scrontab job from running. You can also find and modify your
scrontab job on any login node.
You can edit your
scrontab script with
once you save your script, it will automatically be scheduled by the batch system. By default,
vi is the editor for
scrontab, if you desire a different editor, you can set the
EDITOR environment variable (e.g.
You can view your existing scripts with
Example Scrontab Script¶
Each script should includes traditional Slurm flags like
-t. Here's an example scrontab script that will run every three hours (note the
#SCRON --open-mode=append line which will tell slurm to append any new output to the output file):
#SCRON -p cron #SCRON -A <account>_g #SCRON -t 00:30:00 #SCRON -o output-%j.out #SCRON --open-mode=append 0 */3 * * * <full_path_to_your_script>
Scrontab times are in UTC
Currently scrontab times on Perlmutter are in UTC.
Monitoring Your Scrontab Jobs¶
You can monitor your
scrontab jobs with
squeue --me -p cron -O JobID,EligibleTime
This will show the next time the batch system will run your job. If the
scrontab job is set to repeat, the system will automatically reschedule the next job. Additionally, if you modify your scrontab job, slurm will automatically cancel the old job and resubmit an new one.
Other Workflow Tools¶
If you find that these tools don't meet your needs, you can check out some of the other workflow tools we currently support.