Perlmutter Timeline¶
This page records a brief timeline of significant events and user environment changes on Perlmutter. Please see our current known issues page for a list of known issues on Perlmutter.
October 3, 2024¶
Podman, NERSC’s new container runtime, has been re-enabled on login nodes. Jobs using containers on compute nodes will still need to use Shifter. Additional documentation on porting between podman-hpc and shifter is available.
September 23, 2024¶
Podman, NERSC’s new container runtime, has been temporarily disabled to address a functionality issue. This affects Apptainer as well. In the meantime, please use Shifter for your container runtimes.
September 12, 2024¶
- Neo, the file system control software for Perlmutter scratch, was updated to 6.7-022. This is intended to improve system stabiility.
- Fixed issue that was stopping podman-hpc jobs from running via slurm on login nodes.
- Firmware and hardware updates. This is intended to improve system stability.
July 31, 2024¶
- Maximum job walltime limit increased from 24 to 48 hours.
- Upgrade Slurm to version 23.11.
- Changed the underlying storage for the Slurm control daemon and its associated database cluster. This is intended to improve performance.
- NERSC module updates:
- berkeleygw: added 4.0-gpu
- darshan: updated 3.4.4, 3.4.4-cpe-23.03
- e4s: removed 22.11
- forge: added 24.0 (new default), removed 23.0.2
- julia: added 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, updated 1.8.5, 1.9.4, removed 1.9.2, 1.9.3
- mpich: updated 4.2.0
- nvidia: removed 22.2
- nvshmem: updated 2.11.0
- PrgEnv-llvm: added 1.0
- spack: removed e4s-22.11
- sqs: updated 1.0
- vasp: added 6.4.3-cpu, 6.4.3-gpu, updated 5.4.1-cpu, 5.4.4-cpu, 6.3.2-cpu, 6.4.1-cpu, 6.4.2-cpu, 6.2.1-gpu, 6.3.2-gpu, 6.4.1-gpu, 6.4.2-gpu
- vasp-tpc: updated 5.4.4-cpu, 6.3.2-cpu, 6.4.2-cpu
June 26, 2024¶
- Additional 9PB of capacity added to Perlmutter's scratch file system. This capacity will be available to users after testing completes.
- Update Slurm configuration to align with changes in hardware discovery.
- Hardware and configuration work to prepare for future capacity enhancements.
- Hardware and cable replacements. This is intended to increase resiliency.
- NERSC module updates:
- New Features and Improvements:
- Added cudatoolkit 12.4
- Added nvhpc 24.5
- Added PyTorch 2.3.1
- Added NCCL 2.21.5
- Added cuDNN 9.1.0
- Added new version of Codee
- Added Intel 2024.1.0 with NVIDIA GPU support
- Added OpenMPI 5.0.3
- Updates and Changes:
- Updated PyTorch 2.1.0-cu12 dependencies
- Updated TensorFlow install locations and modified 2.12 for conda cudatoolkit
- Moved several cuDNN, NCCL, and PyTorch packages to new paths
- Changed Spack modulefile to version 0.22 and set as default
- Deprecations and Removals:
- Deprecated intel-llvm
- Removed gpu-test module
- Removed TensorFlow 2.6
- Deprecated E4S 22.11
- Deprecated NVIDIA 22.2
- Removed cuDNN 8.2.0
- New Features and Improvements:
May 15, 2024¶
- GPU jobs using at least 128 nodes and CPU jobs using at least 256 nodes now receive a 50% discount.
- Purge job data older than 6 months from slurm accounting DB. This is intended to shrink the size of the accounting DB so that slurm can be updated in a future maintnenace.
- Increase DVS network credits. This is intended to improve metadata performance.
- Adjusted slurm prolog script to avoid start up delays when jobs are submitted from a DVS file system (homes, global common, or CFS).
- Modified cooling system settings to improve resiliency and decrease performance variability.
- Hardware and cable replacements. This is intended to increase resiliency.
- NERSC module updates:
- codee: default changed from 2023.1.7 to 2024.2.2
- darshan: deleted 3.3.1, 3.4.0
- kokkos: added 4.3.00
- llvm: added 18.1.0
- mpich: added 4.2
- nsight-compute: deleted 2022.1.1
- nsight-system: deleted 2022.1.1
- nvshmem: added 2.11
- paraview: added 5.12.0, deleted 5.11.1
- qchem: added 6.0.0
- spack: deleted 0.19.0, 0.19.2
- total-view: added 2024.1.21, deleted 2023.1.6, default now 2024.1.21
- vasp: added 6.4.2, deprecated 6.2.1, 6.3.2, 5.4.1
May 1, 2024¶
- Rolling reboot, login nodes and the scratch file system remained available
- Adjusted DVS
ksocklnd
network driver parameters (credits/peer_credits/conns_per_peer) and returned to default CPU Partitions (CPT). This is intended to improve performance for reads and writes for file systems that use DVS (CFS, homes, and global common). - Rebuilt slurm with a more recent GCC and deployed with a newer container image. This is intended to increase system stability.
April 17, 2024¶
- Changed the DVS service network drivers (from
kfilnd
toksocklnd
) to avoid a bug that could cause data corruption. This is expected to reduce performance for large scale I/O by as much as a factor of two. - Changes to the shifter configuration to allow updates without downtime.
- Upgrades to some network configurations to improve stability and better identify components that needed attention.
- Chassis controller and node controller firmware updates. This is intended to fix a number of issues including an issue with energy usage reporting on the nodes.
- NERSC module updates:
- darshan: added 3.4.4 and 3.4.4-cpe-23.03
- arm-forge / forge: added 23.1.2 as the new default
March 20, 2024¶
- Slingshot Host Software upgraded to version 2.1.2. This is intended to improve network performance and stability.
- NERSC module updates:
- hip: deleted 5.4.3
- libxc: deleted 4.3.4 and 6.2.2
- nccl: added 2.19.4 as the new default
- Update to kernel parameters on the DVS nodes. This is intended to improve performance accessing CFS via DVS.
- Numerous replacements to address minor hardware issues that occurred since the last maintenance.
February 22, 2024¶
- Adjusted network settings to decrease effects of transient network interruptions (aka link flaps).
- Update to DVS software to address an issue where nodes were occasionally failing with a “dropped messages” error.
- Adjusted GPFS parameters on the gateway nodes to try to improve performance under heavy load.
- Added
kdreg2
memory registration monitor. Users can test this by settingFI_MR_CACHE_MONITOR=kdreg2
at the beginning of their job. - NERSC module updates:
- codee: default changed from 2023.1.7 to 2024.1.1
- forge: default changed from 23.1 to 23.1.1
- fpm: added 0.10.0
- gsl: deleted 2.7
- lammps: deleted 2022.11.03
- llvm: deleted 17, nightly updated from v16 to v18
- matlab: added R2023b
- matlab-mcr: added R2023b
- PrgEnv-llvm: added 0.5
- qchem: deleted 5.4.2-cpu and 6.0.0-cpu
- tensorflow: added 2.15.0
- valgrind: added 3.22.0
January 30, 2024¶
Rolling reboot:
- Podman, NERSC’s new container runtime, re-enabled.
- NERSC module updates:
- pytorch: 1.31.1 and 2.0.1 updated to work with cudatoolkit/11.7
- tensorflow: 2.6.0 and 2.12.0 updated to work with cudatoolkit/11.7
January 17, 2024¶
- Major changes to user programming environment. We expect most users should not need to re-compile their code, though we recommend testing existing applications in the new environment:
- CPE upgraded to 23.12
- Default cudatoolkit module version updated to cudatoolkit/12.2
- gcc compiler will now be provided by the underlying SLES OS
- Users wishing to access the older programming environment can load the
cpe/23.03
module - NERSC module updates:
- openmpi: added 5.0.0
- llvm: added 17.0.6, removed 17.0.2
- conda: added Miniconda3-py311_23.11.0-2
- python: updated environments for compatibility with cpe/23.12
- Update BIOS on GPU nodes to address issues where nodes crashed during some jobs with a “DPC Containment” error
- Network tuning parameters configured to automatically retune. This is intended to reduce node failures and timeouts during some user jobs.
More details for the above changes can be found in Jan 17, 2024 user announcement email.
December 20, 2023¶
- Added 3 PB capacity to the Perlmutter Scratch File System. This increases the total usable capacity to 36 PB.
- COS updated to 2.5.143, intended to improve system stability.
- Slingshot Host Software upgraded to version 2.1.1. This is intended to improve network performance and stability.
- Slurm updated to 23.02.7 and PMIx plugin updated to v4.2.7.
- NERSC module updates:
- forge: add 23.1-linux-x86_64, remove 23.0-linux-x86_64
- tensorflow/2.9.0: update XLA_FLAGS to find libdevice
- totalview: add 2023.4.16, remove 2022.1.11 and 2022.4.27
- qchem: add deprecation notice, to be retired Jan 16 2024.
December 14, 2023¶
Podman, NERSC’s new container runtime, has been temporarily disabled to address a functionality issue.
November 29, 2023¶
- Upgrade to DVS service to improve stability accessing Global Homes, Global Common, and CFS (COS 2.5.139)
- Upgrade to Slurm to improve sacct query efficiency
- DNA file system mounted read-only on login nodes
- Improved experience for Jupyter users during rolling reboots
- Removed an upstream issue that could cause node reboots to time out and fail
- Numerous replacements to address minor hardware issues that occurred since the last maintenance.
- NERSC module updates:
- cudatoolkit: 11.4 and 11.5 retired, 12.2 added as experimental
- nvhpc: 21.9 and 21.11 retired, 23.9 added
- nccl: 2.11.4 retired
- pytorch: 1.9.0 and 1.10.0 retired, 2.1.0-cu12 added
- julia: removed many older versions and added many new ones
- e4s: 23.08 added
- idl: 8.9 added
November 17, 2023¶
- Jobs requesting GPU nodes (
-C gpu
) in the interactive, realtime, and Jupyter QOSes will preferentially run on nodes with twice as much memory (512 GB of RAM and 80 GB of GPU memory per GPU). - Upgrade to Slurm to reduce effects of large numbers of queries.
November 16, 2023¶
- Checksums enabled for Perlmutter scratch to protect data integrity from a bug in the underlying Slingshot network driver. This is expected to reduce scratch performance by 17% for large I/O.
October 10, 2023¶
- Numerous updates to system software intended to improve stability.
- Upgrade Slurm to version 23.02.5.
- Fix podman-hpc additionalimagestore bug introduced on September 28.
- BIOS update for all GPU nodes. This is intended to improve performance and reliability.
- NERSC modules updates:
- Add
gsl/2.7
module to fix an issue with thelammps
module. Note this module is deprecated and will be removed on January 16, 2024. We suggest using the GSL library in the directory /usr/lib64 instead. - Fix
lammps
module. Note this module is deprecated and will be removed on January 16, 2024. - Fix
conda
module issue to allow swapping versions of dependent modules. - Update NERSC python envs with latest ipympl and matplotlib to be compatible with an upcoming update to NERSC JupyterLab.
- Add
September 28, 2023¶
- Slingshot Host Software upgraded to version 2.1. This is intended to improve network performance and address issues some codes were seeing at scale.
- BIOS updated for all CPU nodes to 1.7.1. This is intended to address performance issues seen by some user codes.
- Numerous improvements to NERSC file systems:
- Gateway nodes configuration changed to optimize DVS thread placement to improve DVS performance for Global Common and CFS.
- Updated Lustre configuration to further optimize settings for using SSDs (increase trim frequency, changing
mb_last_group
settings to improve the algorithm looking for free space, and disabling write back throttling).
- Perlmutter network expanded to allow the ability to deploy increased nodes and file system capacity on demand.
- CPE 23.09 deployed as a non-default, experimental environment. To access this, users will need to load the
cpe/23.09
module file. We recommend users wishing to use this experimental CPE recompile their codes. Users will also need to manually append-lcudart -lcuda
to LDFLAGS when compiling. This is intended to give users early access to new versions of software (including cuda12 with cray-mpich integration). It is not expected that these steps will be necessary when the default CPE is upgraded at a future date. - Podman-hpc has been upgraded to 1.0.3 to address several issues related to pulling and building. For now users must use
podman-hpc images --storage-opt additionalimagestore=$SCRATCH/storage
to display migrated images. This is scheduled to be fixed in the next maintenance. - Jobs submitted to the preempt QOS for both CPU and GPU architectures must now request a minimum time of 2 hours. Additionally, all jobs submitted to the preempt QOS will now be charged for a minimum of 2 hours of walltime.
August 23, 2023¶
- Rolling update to address an issue csh users encountered when accessing modules.
August 16, 2023¶
- Numerous improvements to DVS, the I/O forwarding system that delivers CFS and Global Common:
- Modified DVS scheduling algorithm to use a "fairness" approach so it will serve I/O request more equally across users (previously this was a FIFO like algorithm). This is intended to reduce hangs experienced by all users when a single user overloads a DVS gateway node. Please see our DVS page for best practices for I/O.
- Optimized DVS monitoring. This is intended to improve stability of the gateway nodes, which are a set of 24 nodes that serve the I/O requests for DVS.
- Updated the firmware to more optimally communicate over the network (
NPS=1
). This is intended to improve performance by allowing DVS and GPFS processes to exchange data more efficiently.
- Neo, the file system control software for Perlmutter scratch, was updated to 6.4-020. This is intended to improve system stabiility.
- Shared QOSes for interactive and debug QOS are now availabile. There's also a
debug_preempt
QOS that can be used for testing preemptive jobs. See our QOSes and Changes page for details. - The default libfabric memory registration monitor was changed to use
userfaultfd
instead ofmemhooks
. This is intended to mitigate some intermittent hangs running codes at large scale. - Upgraded Slurm to version 23.02.4. This includes a number of bugfixes intended to increase system stability and functionality as well as a small change to memory usage accounting to more accurately reflect actual usage.
- SSH software updated. This is intended to improve system stability.
August 1, 2023¶
Live update:
- Extended max timelimit for
regular
,shared
, andoverrun
QOSes to 24 hours.
July 13, 2023¶
Reboot, logins remained available (with two brief interruptions)
- Updated memory limits on login nodes from 128GiB per user to 64GiB. This is intended to improve login node responsiveness. Please run jobs that require a large amount of memory or cores via the batch system, either in the interactive QOS or as a script.
- Updated memory registration monitor to address a problem at scale for NCCL based codes.
- Fixed a bug in the shifter gateway that was causing issues with pulling images from
registry.nersc.gov
June 22, 2023¶
Partial rolling reboot:
- Updated DVS health monitoring code. This is intended to improve system stability.
- Updated Slingshot software to address a bug that was causing nodes to crash.
- Mounted the DNA file system read only on all compute nodes. This is intended to improved file system performance.
June 16, 2023¶
Rolling reboot:
- Updated DVS health monitoring code. This is intended to improve system stability.
- Increased block size for DVS mounts of CFS, Global Common, and Global Homes. This is intended to improve I/O performance.
- Added monitoring "heartbeat" functionality. This is intended to improve system monitoring and stability.
- Upgraded podman-hpc to version 1.0.2. This is intended to address several bugs identified in early testing.
June 14, 2023¶
Neo, the file system control software for Perlmutter scratch, was rolled back to a custom version based on version 6.3 to avoid a bug that was causing frequent Lustre crashes. This version also included some changes intended to improve performance.
June 8, 2023¶
- The Community, Global Common, and Global Homes File Systems moved from native clients to DVS, an HPE proprietary I/O forwarder. This is intended to improve file system stability, reduce hangs when accessing files, and reduce compute node crashes.
- To facilitate faster load times, Global Common was mounted read-only on the compute nodes. It remains writable on the login nodes.
- Neo, the file system control software for Perlmutter scratch, was updated to 6.4-010. This is intended to improve system stabiility.
- Updated the Linux OS to address an RDMA bug that was causing Spectrum Scale to crash intermittently on the login nodes.
- Updated GPUDirect software that is intended to reduce the number of GPU node crashes.
- Updated ulmits to better manage system resources.
May 25, 2023¶
- Updated the firmware on network switches. This is intended to address an issue that is causing intermittent failures for large jobs.
- The connection between Perlmutter and NGF (the Homes, Common, and Community File Systems) was updated to use publicly routable IP addresses. This is intended to reduce hangs when listing and modifying files on NGF.
- Updated the Lustre Network configuration. This is intended to improve file system performance for Perlmutter scratch on the CPU nodes.
- Added some bug fixes to slurm to address issues with VNIs in heterogeneous jobs and PMIx functionality.
- Performed critical security updates.
May 18, 2023¶
Rolling reboot:
- Slurm updated to version 23.02.2. This includes a number of bugfixes intended to increase system stability and functionality.
- Shifter configuration updated to address an issue with mounting the homes filesystem.
- Slingshot software updated to address an issue with
userfaultd
memory registration monitor. - Added a bugfix to address an issue where users would occasionally see I/O errors when accessing files on Perlmutter scratch.
- Improved network resiliency to login nodes.
- Performed cricital security updates.
May 12, 2023¶
Quota enforcement was re-enabled for Perlmutter scratch (see our File Systems Quota and Purging page for details on the Perlmutter scratch quota).
May 4, 2023¶
- Updated slurm to allow support for advanced slingshot functionality
- Deployed a bugfix designed to reduce the frequency of large MPI jobs failing with an
UNDELIVERABLE
error - Adjusted ssh configuration to fix an issue where users couldn't ssh to compute nodes where they have running jobs
- Updated DNS configurations to enable forward and reverse lookups for Perlmutter login and compute nodes from outside the NERSC network. This was intended to address an issue that was greatly slowing ssh connections to some external hosts
- Updated DVS version intended to improve performance over DVS
- Performed a critical security update for Linux OS
- Updated switch configuration to improve security and reliability
April 27, 2023¶
- Job submission and running jobs were paused to upgrade slurm to 23.02. This is intended to increase system stability and introduce new batch system functionality.
- Neo, the file system control software for Perlmutter scratch, was updated to 6.3-023. This is intended to improve system stabiility.
- A reoptimized boot schema was also deployed. This is intended to reduce boot time and improve system stability.
Both the Neo update and boot schema update were applied as a rolling update with jobs running and logins available. Login nodes also remained available for the slurm upgrade.
April 17, 2023¶
- Improved the cgroups process control algorithm on the login nodes. This is intended to reduce the rate that user processes have been unexpectedly terminated on the login nodes (this primarily manifested as ssh sessions getting terminated mid-session).
- Added experimental DVS mounts of the CFS file system for testing and evaluation.
April 11, 2023¶
- Changes to the User Environment:
- The programming environment has been updated to CPE 23.03. A full list of changes are in the HPE changelog, but notable changes include:
- Cray MPICH 8.1.25 from 8.1.24
- Cray PMI 6.1.10 from 6.1.9
- Upgraded the GPU driver to 525.105.17 (open source)
- Added HPC SDK 23.1. This adds both the nvidia/23.1 and cudatoolkit/12.0 modules. These modules are marked experimental because they are not yet fully integrated with the system MPI so cuda aware MPI will not work with them.
- Removed HPC SDK 22.5 and 22.9 (nvidia/22.5 and nvidia/22.9 modules).
- PrgEnv-Intel deployed
- The programming environment has been updated to CPE 23.03. A full list of changes are in the HPE changelog, but notable changes include:
- Batch System Changes:
- Shared QOS enabled for GPU nodes
- Slurm has been changed to prevent accidental cancellation of scrontab jobs which would lead to the entire scrontab entry being disabled.
- Changes to address file system stability and performance:
- Switched to fair queueing to address connection timeouts with the NGF file system and increased max receiver threads. These are intended to address stability issues with CFS.
- Optimized routing between login nodes and NGF servers. This is intended to reduce pauses when accessing, creating, or updating files on CFS, global common, and homes.
- Updated Neo, the file system control software for Perlmutter scratch, to 6.3-022. This is intended to increase file system stability and performance.
- Numerous changes intended to improve system resiliency and stability
- Retooled the internal boot logic to dramatically reduce time for node reboots. In addition to shortening maintenance times, this lays the groundwork for improved health checks to more efficiently remove problem nodes from the batch system.
- Mapped power paths to critical infrastructure. No changes were made, but this involved physically interacting with the cables and plugs so it was done during maintenance to avoid disruption. This is intended to increase system resiliency and stability.
March 20, 2023¶
- The SS11 feature for performant GPU-RDMA has been re-enabled for all new jobs starting as of 9:30am PDT. NERSC worked with the vendors to obtain a workaround for the issue and deployed it this morning on the compute nodes using a non-disruptive rolling reboot. We expect to deploy a full fix during the next scheduled maintenance on March 22, 2023.
March 15, 2023¶
- A SS11 feature for performant GPU-RDMA has been temporarily disabled to mitigate a critical issue leading to node failures. This is expected to substantially affect performance of applications using GPU-RDMA capabilities for inter-node communication (such as CUDA-Aware MPI or GASNet), but will allow jobs that were previously crashing to run.
We expect to be able to remove this mitigation during the next scheduled maintenance on March 22, 2023.This mitigation was removed on March 20, 2023.
March 8, 2023¶
- The programming environment has been updated to CPE 23.02. A full list of changes are in the HPE changelog, but notable changes include:
- Cray MPICH 8.1.24 from 8.1.22
- Cray PMI 6.1.9 from 6.1.7
- HDF5 1.12.2.3 from 1.12.2.1
- Parallel NetCDF 1.12.3.3 from 1.12.3.1
- Moved to using the open NVIDIA driver instead of the proprietary driver (keeping the version the same at 515.65.01) to include new functionality that will enable sharing GPU nodes.
- Moving the gateway nodes (nodes used for communication with external resources like NGF) to the “stock” SLES kernel to be able to leverage the kernel fastpath feature for packet forwarding. This is intended to improve read and write rates to NGF file systems.
- General software and cabling updates to increase system stability and performance.
- Recable the internal network to minimize the impact of a switch failure on the high-speed network. By ensuring that the fabric manager (the software that controls Perlmutter’s interior network) maintains connectivity to each high-speed network group, this will provide additional resiliency system-wide.
- Upgrade the Slingshot software to add improvements in the retry handler algorithm (this governs how missed packets are handled) to reduce amplification of messages about inaccessible nodes and to fix a bug that caused certain user codes to crash. This is intended to increase network and compute resiliency.
- Update the COS software (a software layer that contains the underlying SLES OS as well as interfaces to other things like the file system) to upstream a SLES fix to address a
readahead
issue that was crashing nodes. This is intended to improve system stability. - Updated Neo, the file system control software, to 6.2-015. This is intended to increase file system stability and performance.
February 23, 2023¶
- Numerous changes intended to improve network stability and file system access
- Further changes to the TCP queue discipline to more fairly allocate bandwidth on our gateway nodes (nodes used for communication with external resources like NGF) between all the TCP streams from the computes while keeping latencies at optimal levels. This is intended to improve access to file systems like CFS, global common and global homes for batch jobs.
- Connected gateway nodes to new switches. This is intended to improve network performance and resiliency.
- Multiple cable adjustments to bring system into compliance with its cabling plan. This is intended to simplify maintenance and improve resiliency.
February 19, 2023¶
- Multi-connection over TCP set back to 1 (from 2) for CFS to mitigate a bug.
February 15, 2023¶
- Numerous changes intended to improve network stability and file system access.
- General Network Issues:
- Added static ARP entries for the compute nodes. This is intended to fix the issue where a large number of jobs either failed to launch or started in slurm but produce no output.
- Increased sweep frequency for the fabric manager, the software that controls Perlmutter’s interior network, and added shorter timeouts throughout the system. These changes are intended to improve general network robustness and shorten recovery times for component failures.
- Replaced a defective network switch.
- Added code to filter bad unicast traffic from the network. This is intended to improve network stability.
- Filesystem hangs and slowness:
- Adjusted port policy for Lustre file system and debugged long recovery times for single component failure in Lustre. This is intended to inform and simplify future work to improve Lustre reliability and responsiveness.
- General Network Issues:
February 9, 2023¶
- Numerous changes intended to improve network stability and file system access.
- General Network Issues:
- Change default TCP queue discipline to
fq_codel
to address missing patch in SLES 15SP4. This is intended to address many of the communication failures larger scale jobs are experiencing. - Recabling (replacing defective equipment and correcting misconnected cables). This is a lengthly physical process and is intended to increase network stability.
- Updating firmware to improve network link stability and reliability of key I/O nodes.
- Change default TCP queue discipline to
- Filesystem hangs, slowness, and I/O errors:
- Numerous network tuning and changes (Increase Multi-connection over TCP to 2 to maximize connectivity to Spectrum Scale servers, ipoib parameter changes, etc.). This is intended to reduce stale file handles and job failures from nodes getting expelled from the Spectrum Scale cluster.
- Components that use DVS (an I/O forwarding service within Perlmutter) have been converted to using other delivery methods. This is intended to reduce the number of software components in use in order to simplify debugging of the network issues. Most users will not be affected by this change.
- Users using read-only mounts of CFS may see slower or more-variable performance while the system is in this configuration.
- cvmfs is now delivered using native clients and loop mounted file systems for caching following the recommended HPC recipe.
- General Network Issues:
- Podman deployed on the system.
February 1, 2023¶
- Network updates intended to improve stability. This maintenance was appended to the unscheduled outage to minimize disruption to users.
- Automatically reboot compute nodes that are in a particular fail mode (softlockup). These nodes cause instability in our Spectrum Scale file systems (CFS, homes, and global common) and rebooting them is intended to reduce file system hangs and outages.
- Changed protocol for communication to the Spectrum Scale cluster to RDMA on Perlmutter login nodes. This is intended to reduce issues accessing the file system.
January 25, 2023¶
- Updates intended to improve network stability and access of cvmfs.
January 19, 2023¶
- Network update intended to improve stability and address Lustre performance issues.
- Preemptible jobs on Perlmutter are free until February 19, 2023.
- Darshan module removed from list of default modules loaded at start up.
December 21, 2022¶
- Major hardware issues that were impacting the network performance have been addressed and PM has undergone massive full scale stress testing
- The Slingshot software stack has been updated to a new version that is expected to be more robust.
- 256 GPUs nodes with double the GPU-attached memory have been added to the system. Please see our jobs policy page for instructions on how to access them.
- 1536 new CPU nodes have been added to the system.
- A new NVIDIA NCCL plugin has been installed that more efficiently uses the Slingshot network. This has been integrated into NERSC’s machine learning and vasp modules so if you use these modules no further action is needed. However, other workflows will need some adjusting:
- You will now need to use the
--module=nccl-2.15
to get access to the new nccl plugin in shifter. Please see our shifter documentation for instructions. - If you install your own software that depends on NCCL, please use the new
nccl
module to get access to the new NCCL plugin libraries - Some machine learning workloads running in older NGC containers (versions from before 2022) may encounter performance variability. These issues can be fixed by upgrading the container to a more recent version.
- You will now need to use the
- The OS has been updated to SLES SP4 and the programming environment has been updated to CPE 22.11.
- Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided
python
module. Users may see errors likeImportError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d
. Please see our Perlmutter python documentation for more information. - The default version of the NVIDIA HPC SDK compiler was upgraded to 22.7 (from 22.5)
- Cray MPICH upgraded to 8.1.22 (from 8.1.17)
- GCC v12 now available
- Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided
October 28, 2022¶
Charging for all jobs began.
October 26, 2022¶
- Slurm updated to 22.05
- The 128-node limit on the
regular
QOS for GPU nodes has been removed. Regular can now accept jobs of all sizes. - The
early_science
QOS has been removed. Please useregular
instead. All queued jobs in theearly_science
QOS have been moved toregular
. - Numerous updates intended to improve system stability and networking
October 11, 2022¶
- Major changes to the internal network and file system to get Perlmutter into its final configuration. Some tuning and changes are still required and will be applied over the next few weeks
September 15, 2022¶
- Perlmutter scratch is now available, but it is still undergoing physical maintenance. We expect scratch performance to be degraded and single-component failures could cause the filesystem to become unavailable during this physical maintenance. We estimate a 20% chance that this will occur in the next month. Please hold any jobs with scratch licenses that you don't want to run by noon on Friday (9/16) with
scontrol hold <jobid>
. - Numerous updates intended to improve system stability and Community and Home File System access.
September 7, 2022¶
- The software environment has been retooled to better focus on GPU usage. These changes should be transparent to the vast majority of both GPU and CPU codes and will help remove the toil of reloading the same modules for every script for GPU-based codes. As our experience with the system grows, we expect to be adding more settings that are expected to be globally beneficial.
- New
gpu
module added as a default module loaded at login. It includes:module load cudatoolkit
module load load craype-accel-nvidia80
- Sets
MPICH_GPU_SUPPORT_ENABLED=1
to enable access to CUDA-aware Cray MPICH at runtime
- A companion
cpu
module- This module is mutually exclusive to the
gpu
module; if one is loaded, the other will be unloaded - In the future we may add any modules or environment settings we find to be generally beneficial to CPU codes, but for now it is empty
- Given the current contents, CPU users should be able to run their codes with the
gpu
module. But if there are any problems, users canmodule load cpu
to revert thegpu
module
- This module is mutually exclusive to the
- Shifter users who want CUDA-aware Cray MPICH at runtime will need to use the cuda-mpich shifter module
- New
- Long-lived scrontab capabilities added to better support workflows
- A number of performance counters (e.g., CPU, Memory) that are used by NERSC supported performance profiling tools have been re-enabled on the system
August 24, 2022¶
- Perlmutter Scratch file system unmounted for upgrading. All data on Perlmutter Scratch will be unavailable. Jobs already in the queue that were submitted from Perlmutter Scratch will be automatically held. If you submitted a job that depends on scratch from another file system, you can add a scratch license with
scontrol update job=<job id> Licenses=scratch[,<other existing licenses>...]
to have your job held until scratch is available. - Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.
August 15, 2022¶
- All Slingshot10 GPU nodes are removed from the system along with their corresponding QOSes (e.g.,
regular_ss10
)- Any queued jobs in the Slingshot10 QOSes were moved to their corresponding Slingshot11 QOSes
- Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.
August 8, 2022¶
- Added NVIDIA HPC SDK Version 22.7
- To use:
module load PrgEnv-nvidia nvidia/22.7
- To use:
- Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.
August 1, 2022¶
- Default switched to Slingshot11 for GPU nodes.
- Default QOS switched from GPU nodes using the Slingshot10 interconnect to nodes using the Slingshot11 interconnect. If you still wish to run on the Slingshot10 GPU nodes, you can add
_ss10
to the QOS on your job submission line (e.g.,-q regular_ss10 -C gpu
). All queued jobs will run in the QOS that was active when they were submitted. - Use
squeue --me -O JobID,Name,QOS,Partition
to check which QOS and partition your jobs are in. - Login nodes now use the Slingshot11 interconnect.
- Default QOS switched from GPU nodes using the Slingshot10 interconnect to nodes using the Slingshot11 interconnect. If you still wish to run on the Slingshot10 GPU nodes, you can add
- CUDA driver upgraded to version 515.48.07
- NVIDIA HPC SDK (
PrgEnv-nvidia
) and CUDA Toolkit (cudatoolkit
) module defaults upgraded to 22.5 and 11.7 respectively. The previous versions are still available.- CUDA compatibility libraries are no longer needed, so if you were employing work arounds to remove them they should no longer be needed.
- Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.
June 20, 2022¶
Default striping of all user scratch directories set to stripe across a single OST because of a bug in the Progressive File Layout striping schema. If you are reading or writing files larger than 1GB please see our recommendations for Lustre file striping.
July 18, 2022¶
- The second set of GPU nodes have been upgraded to Slingshot11 and added to the
regular_ss11
QOS (see the discussion in July 11, 2022).- We expect the number of Slingshot11 GPU nodes to be changing over the next few weeks, so we recommend you use
sinfo
to track the number of nodes in each partition. You can usesinfo --format="%.15b %.8D"
for concise summary of nodes orsinfo -o "%.20P %.5a %.10D %.16F"
for more verbose output.
- We expect the number of Slingshot11 GPU nodes to be changing over the next few weeks, so we recommend you use
July 11, 2022¶
- First GPU nodes are upgraded to use the Slingshot11 interconnect. These nodes have upgraded software and 4x25GB/s NICs (previously they had 2x12.5GB/s NICs). Jobs will need to explicitly request these nodes by adding
_ss11
to the QOS, eg-C gpu -q regular_ss11
.- There are currently 256 nodes converted to Slingshot11. We expect this number of nodes to be changing over the next few weeks, so we recommend you use
sinfo
to track the number of nodes in each partition. You can usesinfo --format="%.15b %.8D"
for concise summary of nodes orsinfo -o "%.20P %.5a %.10D %.16F"
for more verbose output.
- There are currently 256 nodes converted to Slingshot11. We expect this number of nodes to be changing over the next few weeks, so we recommend you use
- CPE default updated to 22.06. Notable changes:
- PrgEnv-cray (cce/14) now supports OpenMP offload and OpenACC on Perlmutter GPUs
- cray mpich upgraded to 8.1.17 (from 8.1.15)
- NVIDIA compiler version 22.5 and cudatoolkit SDK version 11.7 now available on the system. These will become the defaults soon.
- Shared QOS now available on the CPU nodes
- Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter and make cvmfs more stable
June 6, 2022¶
- Changes to the batch system
- Users can now use just
-A <account name>
(i.e., the extra_g
is no longer needed) for jobs requesting GPU resources. - Xfer QOS added for data transfers
- Debug QOS now the default
- Users can now use just
- The cuda compatibility libraries were removed from the PrgEnv-nvidia module (specifically the
nvidia
module). The cuda compatibility libraries are now exclusively in thecudatoolkit
module and users are reminded to load this module if they are compiling code for the GPUs. - Second set of CPU nodes are now available to users.
- Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter
June, 2022¶
Achieving 70.9 Pflop/s (FP64 Tensor Core) using 1,520 compute nodes, Perlmutter is ranked 7th in the Top500 list.
May 25, 2022¶
- Maximum job walltime for
regular
(CPU and GPU nodes) andearly_science
(GPU nodes) QOSes increased to 12 hours
May 17, 2022¶
- Perlmutter opened to all NERSC Users!
- The default Programming Environment is changed to PrgEnv-gnu
- Shifter MPI now working on CPU nodes
- PrgEnv-aocc now working
- Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter
May 11, 2022¶
- The first set of Slingshot11 CPU nodes are now available for user jobs. Please see the Perlmutter QOS policy page for QOS details.
April 29, 2022¶
- CPE default updated to 22.04. You may choose to load an older CPE but the behavior is not guaranteed.
- Notable changes — cray mpich upgraded to 8.1.15
- Nvidia driver has been updated to 470.103.01
- Removed
nvidia/21.9
(nvhpc sdk 21.9) from the systemcudatoolkit/11.0
andcudatoolkit/11.4
dropped as available modules.- You can continue to compile using older cuda versions with CUDA compatibility libraries
- Numerous internal upgrades (software and network stack) to prepare the Phase-2 integration of Perlmutter
- Re-compile is not needed, but if you’re having issues please do try recompiling your application first.
April 21, 2022¶
- Node limit restriction for
early_science
qos has been lifted. Perlmutter QOS policy.
April 7, 2022¶
- NVIDIA Data Center GPU Manager (dcgm) enabled on all nodes. Users will need to disable dcgm before running profiler tools that require access to hardware counters.
- Newest versions (2022.x) of nsight-compute and nsight-system removed pending vendor bug fixes
- Numerous internal updates to improvement configuration, reliability, and performance
March 25, 2022¶
- Numerous internal updates to improvement configuration, reliability, and performance
March 10, 2022¶
- Nvidia HPC SDK v21.11 now default
- Older cudatoolkit modules removed
- Slurm upgrade to 21.08, codes that use gpu-binding will need to be reworked
- CPE 21.11 has been retired
- There will be no support for
gcc/9.3
- nvcc v11.0 (
cudatoolkit/11.0
) retired, will no longer be supported
- There will be no support for
- Numerous internal updates to improvement configuration, reliability, and performance
February 24, 2022¶
- Cudatoolkit modules simplified
- New modules with shorter names point to the most recent releases available
- Old modules will remain on the system for a short time to allow time to switch over
- Nvidia HPC SDK v21.11 now available
- Default will remain 21.9 for a short time to allow time for testing
nvidia/21.9
does not support Milan, so the Cray compiler wrappers will build for Rome instead. We recommend that users switch tonvidia/22.11
.
- Upgraded to CPE 22.02. Major changes include:
- MPICH 8.1.12 to 8.1.13
- PMI 6.0.15 to 6.0.17
- hdf5 1.12.0.7 to 1.12.1.1
- netcdf 4.7.4 to 4.8.1.1
- Change to sshproxy to support broader kinds of logins
- Realtime qos functionality added
- Numerous internal updates to improvement configuration, reliability, and performance
February 10, 2022¶
- Node limit for all jobs temporarily lowered to 128 nodes
- QOS priority modified to encourage wider job variety
January 25, 2022¶
- Cudatoolkit modules now link to correct math libraries (fixes Known Issue "Users will encounter problems linking CUDA math libraries").
- Update to DVS configuration to support CVMFS.
- The latest Nsight systems and Nsight compute performance tools are now available.
- Numerous internal upgrades to improve configuration and performance.
January 11, 2022¶
- Upgraded to CPE 21.12. Major changes include:
- MPICH upgraded to v8.1.12 (from 8.1.11)
- The previous programming environment can now be accessed using the
cpe
module. - Numerous internal upgrades to improve configuration and performance.
December 21, 2021¶
- GPUs are back in "Default" mode (fixes Known Issue "GPUs are in "Exclusive_Process" instead of "Default" mode")
- User access to hardware counters restored (fixes Known Issue "Nsight Compute or any performance profiling tool requesting access to h/w counters will not work")
- Cuda 11.5 compatibility libraries installed and incorporated into Shifter
- QOS priority modified to encourage wider job variety
- Numerous internal upgrades
December 6, 2021¶
- Major changes to the user environment. All users should recompile their code following our compile instructions
- The
cuda
,cray-pmi
, andcray-pmi-lib
modules have been removed from the default environment - The
darshan
v3.3.1 module has been added to the default environment - Default NVIDIA compiler upgraded to v21.9
- Users must load a
cudatoolkit
module to compile GPU codes
- Users must load a
- Upgraded to CPE 21.11
- MPICH upgraded to v8.1.11 (from 8.1.10)
- PMI upgraded to v6.0.16 (from 6.0.14)
- FFTW upgraded to 3.3.8.12 (from 3.3.8.11)
- Python upgraded to 3.9 (from 3.8)
- Upgrade to SLES15sp2 OS
- Numerous internal upgrades
November 30, 2021¶
- Upgraded Slingshot (internal high speed network) to v1.6
- Upgraded Lustre server
- Internal configuration upgrades
November 16, 2021¶
This was a rolling update where the whole system was updated with minimal interruptions to users.
- Set
MPICH_ALLGATHERV_PIPELINE_MSG_SIZE=0
to improve MPI communication speed for large buffer size. - Added
gpu
andcuda-mpich
Shifter modules to better support Shifter GPU jobs - Deployed fix for
CUDA Unknown Error
errors that occasionally happen for Shifter jobs using the GPUs - Changed ssh settings to reduce frequency of dropped ssh connections
- Internal configuration updates
November, 2021¶
Perlmutter achieved 70.9 Pflop/s (FP64 Tensor Core) using 1,520 compute nodes, putting the system at No. 5 in the Top500 list.
November 2, 2021¶
- Updated to CPE 21.10. A recompile is recommended but not required. See the documentation of CPE changes from HPE for a full list of changes. Major changes of note include:
- Upgrade MPICH to 8.1.10 (from 8.1.9)
- Upgrade DSMML to 0.2.2 (from 0.2.1)
- Upgraded PMI to 6.0.14 (from 6.0.13)
- Adjusted QOS configurations to facilitate Jupyter notebook job scheduling.
- Added
preempt
QOS. Jobs submitted to this QOS may get preempted after two hours, but may start more quickly. Please see our instructions for running preemptible jobs for details.
October 20, 2021¶
External ssh access enabled for Perlmutter login nodes.
October 18, 2021¶
- Updated slurm job priorities to more efficiently utilize the system and improve the diversity of running jobs.
October 14, 2021¶
- Updated NVIDIA driver (to 450.162). This is not expected to have any user impact.
- Upgraded internal management framework.
October 9, 2021¶
Screen
andtmux
installed- Installed boost v1.66
- Upgraded nv_peer_mem driver to 1.2 (not expected to have any user impact)
October 5, 2021¶
Deployed sparewarmer
QOS to assist with node-level testing. This is not expected to have any user impact.
October 4, 2021¶
Limited the wall time of batch jobs to 6 hours to allow a variety of jobs to run during testing. If you need to run jobs for longer than 6 hours, please open a ticket.
September 29, 2021¶
- Numerous internal network and management upgrades.
New batch system structure deployed¶
- Users will need to specify a QOS (with
-q regular
,debug
,interactive
, etc.) as well as a project GPU allocation account name which ends in _g (e.g.,-A m9999_g
)- We have some instructions for setting your default allocation account in our Slurm Defaults Section
- Please see our Running Jobs Section for examples and an explanation of new queue policies
September 24, 2021¶
- Upgraded internal management software
- Upgraded system I/O forwarding software and moved it to a more performant network
- Fixed csh environment
- Performance profiling tool that request access to hardware counters (such as Nsight Compute) should work now
September 16, 2021¶
- Deployed numerous network upgrades and changes intended to increase responsiveness and performance
- Increased robustness for login node load balancing
September 10, 2021¶
- Updated to CPE 21.09. A recompile is recommended but not required. Major changes of note include:
- Upgrade MPICH to 8.1.9 (from 8.1.8)
- Upgrade DSMML to 0.2.1 (from 0.2.0)
- Upgrade PALS to 1.0.17 (from 1.0.14)
- Upgrade OpenSHMEMX to 11.3.3 (from 11.3.2)
- Upgrade craype to 2.7.10 (from 2.7.9)
- Upgrade CCE to 12.0.3 (from 12.0.2)
- Upgrade HDF5 to 1.12.0.7 (from 1.12.0.6)
- GCC 11.2.0 added
- Added
cuda
module to the list of default modules loaded at startup - Set BASH_ENV to Lmod setup file
- Deployed numerous network upgrades and changes intended to increase responsiveness and performance
- Performed kernel upgrades to login nodes for better fail over support
- Added latest CMake release as
cmake/git-20210830
, and is set as the defaultcmake
on the system
September 2, 2021¶
- Updated NVIDIA driver (to
nvidia-gfxG04-kmp-default-450.142.00_k4.12.14_150.47-0.x86_64
). This is not expected to have any user impact.
August 30, 2021¶
Numerous changes to the NVIDIA programming environment¶
- Changed default NVIDIA compiler from 20.9 to 21.7
- Installed needed CUDA compatibility libraries
- Added support for multi-CUDA HPC SDK
- Removed the
cudatoolkit
andcraype-accel-nvidia80
modules from default
Tips for users:
- Please use
module load cuda
andmodule av cuda
to get the CUDA Toolkit, including the CUDA C compilernvcc
, and associated libraries and tools. - CMake may have trouble picking up the correct mpich include files. If it does, you can use
set ( CMAKE_CUDA_FLAGS "-I/opt/cray/pe/mpich/8.1.8/ofi/nvidia/20.7/include")
to force it to pick up the correct one.
June, 2021¶
Perlmutter achieved 64.6 Pflop/s (FP64 Tensor Core) using 1,424 compute nodes, putting it at No. 5 in the Top500 list.
May 27, 2021¶
Perlmutter supercomputer dedication.
November, 2020 - March, 2021¶
Perlmutter Phase 1 delivered.