Case studies for Superfacility¶
Here, we document some of the success stories of some of our science engagements in the Superfacility project. A comprehensive report of that project was published in https://arxiv.org/abs/2206.11992. Further information about the Superfacility project is provided at https://www.nersc.gov/research-and-development/superfacility/ where you also find demos and publications.
We are reposting these case studies about our science engagements to provide easy references to the rest of NERSC's documentation, and to get into more (technical) detail than the report. If your workflow or science case resembles any of the case studies mentioned here, feel free to reach out (via ticket) to request further information and guidance.
Advanced Light Source¶
The Advanced Light Source (ALS), a synchrotron radiation facility situated at LBNL, is one of DOE’s five large light sources. It comprises about 40 beamlines with numerous experimental endstations, where scientists from around the world (“users”) can conduct research in a wide variety of fields, including materials science, biology, chemistry, physics, and the environmental sciences. ALS serves a user community of roughly 2,000 users per year. Like other light sources, it faces the challenge that current and future upgrades to its storage rings will vastly increase the amount of data that is generated. This flood of data will make local data storage and computing unfeasible in the near future.
ALS became partner in the Superfacility project to address this challenge, with a focus on:
- GPU-enabled analysis code via NESAP
- Modernizing data management, movement, access and archiving, including use of Spin and Federated ID
- Using HPC for near-real-time feedback for their experiments, including interactive data analysis via Jupyter and resilience to operate when NERSC is unavailable
- Empowering users to independently analyze their data even after their experiments are over (hand off).
Key Superfacility needs
NESAP, Policies, Jupyter, Scheduling, Resiliency, Federated ID,
API, Spin, Self-managed Systems, Data movement, Data management.
The ALS has deployed a number of development and production projects in Spin. First, a data portal https://dataportal.als.lbl.gov/ was deployed alongside databases, app server, and several other services supporting workflows for ingesting ALS data. For example, these services access tomography data that is moved from the ALS microtomography beamline to NERSC's community file system (CFS). This data movement service, in turn, leverages improvement to NERSC Globus infrastructure that allowed for writing to NERSC's file systems using a collaboration (i.e. "machine") account. Once the data lands at CFS, the ALS user can simply search and browse their data in the portal based on the metadata.
Second, a service http://alsshare.lbl.gov was launched that streamlines data sharing based on NERSC's Globus share endpoint and that integrates with ALS's user portal ALSHub.
The ALS Share service provides a workflow both for beamline scientists and beamline users. While the user simply registers their ORCID with Globus, the beamline scientist creates a Globus share which is then automatically populated with the collaborators/users of that experiment pulling matching data from ALSHub. This service allows the sharing of data even if the beamline user at ALS does not have any NERSC credentials. More details (including the workflow schematic) can be found in this publication.
Third, a project was created where ALS collaborates with BNL and ANL called “AI/ML for Multi-Modal (AIMM)” that supports data access and data labeling/tagging services on Spin.
Future plans¶
With some beamlines at the ALS now automatically transferring data sets to NERSC as they are collected, an upcoming development will be to enable an ALS Share directory to be automatically set up in advance of data collection, and new data sets will be routed to the that directory so a user and the assigned collaborators can access the data very soon after it is collected. Furthermore, ALS was a key engagement to develop the functionality of the Superfacility API and is currently incorporating the API into their services to kick off standardized computing jobs (or other workloads) for data that has reached NERSC file systems. Finally, ALS envisions all of its users to be able to repeat at NERSC the same analysis that they used during an experiment. ALS intends to use customized Jupyter notebooks and even a customized JupyterLab environment for that purpose.
National Center for Electron Microscopy¶
The National Center for Electron Microscopy (NCEM) facility within the Molecular Foundry at Berkeley Lab recently installed its 4D camera, which outputs data at 480 Gbit/s, resulting in single data sets of 700 GB acquired in about 15 seconds. These are orders of magnitude larger than current data set sizes at the center, and analysis/storage of these data were difficult to impossible using local resources. The Superfacility capabilities implemented by NERSC provided a way for this user center to utilize HPC resources for a data-reduction pipeline for this camera.
NCEM’s requirements are based around enabling near-real-time analysis of large datasets:
- In early stages, NCEM was streaming datasets directly to compute node memory, using software-defined networking (SDN) and an extension of the NERSC network directly to the NCEM instrument. This was a valuable experiment, but ultimately an unsustainable option from the security perspective.
- Now, using SDN, NCEM is transferring datasets directly to the Cori burst buffer (SSD storage layer) for analysis by compute nodes
- Automation of data movement and management via the API
- Subsequent analysis of datasets via Jupyter notebooks with specialized HPC backends.
Key Superfacility needs
Policies, Jupyter, Scheduling, Resiliency, Federated ID, API,
Spin, SDN, Data movement.
NCEM’s 4D Camera produces so much data that it precluded NCEM from using typical workstations for storage and analysis. They have worked with the Superfacility project to provide direct data reduction and analysis support to users at the microscope. Using a 100 Gbit fiber connecting the detector acquisition system and NERSC, they were able to reduce data reduction time by a factor of 2 (from 8 minutes to 4 minutes). This provides near-real-time feedback and also allowed them to free up local system resources to acquire more data than ever before. Further, they utilized Spin and other resources to provide a convenient web application frontend to capture metadata and provide live feedback to the user at the microscope. The distiller app leverages the NERSC Superfacility API to submit and monitor data reduction jobs on the real-time queue. NCEM’s workflow is unique in that data is pulled directly from NCEM’s data server into the compute allocation, which allows it to capitalize on fast job-local data storage solutions like Cori's DataWarp and Perlmutter’s all-flash scratch filesystem. For Cori, NCEM's workflow required the ability allocate load-balanced compute nodes on the ARIES interconnect fabric to optimize data transfer paths to the compute nodes.
NCEM also uses Jupyter notebooks to provide interactive data analysis of the reduced data output. Users who previously needed to be familiar with ssh and command line tools are now able to process their data in real time using common workflows deployed on HPC infrastructure. The system can also be used during post-processing.
Future plans¶
The data reduction step can be further improved using Perlmutter for GPU computation. NCEM plans to utilize Perlmutter's GPUs once testing of data retrieval and reduction has been fully completed. A real-time queue for this system would be absolutely necessary. NCEM also plans to implement full workflows from data generation to final output once suitable workflows have been identified. Another goal is to deploy the same workflow for other large data-generation systems at NCEM to better incorporate live processing of 10-100 GB datasets using image processing and AI/ML. A new detector with ~10x large data rate is planned to be installed in the coming years, requiring even more computation. Finally, NCEM plans to implement an automated data acquisition system that can acquire terabytes of data autonomously with live processing done at NERSC. The Superfacility project is essential to their future plans and workflows in order to deal with exponentially increasing data generation.