Skip to content

Perlmutter Architecture

PerlmutterCabinetsFinal

Perlmutter is a HPE (Hewlett Packard Enterprise) Cray EX supercomputer, named in honor of Saul Perlmutter, an astrophysicist at Berkeley Lab who shared the 2011 Nobel Prize in Physics for his contributions to research showing that the expansion of the universe is accelerating. Dr. Perlmutter has been a NERSC user for many years, and part of his Nobel Prize-winning work was carried out on NERSC machines and the system name reflects and highlights NERSC's commitment to advancing scientific research.

Perlmutter, based on the HPE Cray Shasta platform, is a heterogeneous system comprising both CPU-only and GPU-accelerated nodes, with a performance of 3-4 times Cori.

System Specifications

Partition # of nodes CPU GPU NIC
GPU 1536 1x AMD EPYC 7763 4x NVIDIA A100 (40GB) 4x HPE Slingshot 11
CPU 3072 2x AMD EPYC 7763 - 1x HPE Slingshot 11
Login 40 1x AMD EPYC 7713 1x NVIDIA A100 (40GB) -
Large Memory 4 1x AMD EPYC 7713 1x NVIDIA A100 (40GB) 1x HPE Slingshot 11

Not all nodes are available yet

The CPU and GPU partition configuration show the target number of nodes and not those available today.

System Performance

Partition Type Aggregate Peak FP64 (PFLOPS) Aggregate Memory (TB)
GPU CPU 3.9 384
GPU GPU 59.9
tensor: 119.8
240
CPU CPU 7.7 1536

Interconnect

The network has a 3-hop dragonfly topology.

  • A GPU compute cabinet is segmented into 8 chassis, each containing 8 compute blades and 4 switch blades.
  • A GPU compute blade contains 2 GPU-accelerated nodes.
  • Some GPU-accelerated compute cabinets have Slingshot 10 interconnect fabric with Mellanox NICs while others have Slingshot 11 with HPE Cray's proprietary Cassini NICs. Eventually all cabinets will have Slingshot 11.
  • Each GPU-accelerated compute node in cabinets with Slingshot 10 interconnect fabric is connected to 2 NICs, allowing each node to have 2 injection points into the network. This configuration is sometimes described as dual injection or dual rail. A GPU-accelerated compute node in cabinets with Slingshot 11 fabric is connected to 4 NICs.
  • GPU cabinets contain one Dragonfly group per cabinet, with 32 switch blades, making a total of 24 groups.
  • A CPU-only compute cabinet has 8 chassis, each containing 8 compute blades and 2 switch blades.
  • A CPU-only compute blade contains 4 CPU nodes.
  • Each CPU-only compute node is connected to 1 NIC.
  • CPU-only cabinets contain one Dragonfly group per cabinet.
  • CPU compute cabinets have Slingshot 11 interconnect fabric with Cassini NICs.
  • Unlike Cori, there is no backplane in the chassis to provide the network connections between the compute blades. Network connections are achieved by having the switch blades at the rear of the cabinet, providing the interconnection between the compute blades in the in the front side.
  • A full all-to-all electrical network is provided within each group. All switches in a switch group are directly connected to all other switches in the group.
    • Copper Level 0 (L0) cables connect nodes to network switches. L0 cables carry two links and are split to provide two single link node connections. L0 links are called "host links" or "edge links".
    • Copper Level 1 (L1) cables are used to interconnect the switches within a group. The 16-switch groups are internally interconnected with two cables (four links) per switch pair. L1 links are called "group links" or "local links".
  • Optical Level 2 (L2) cables interconnect groups within a subsystem (e.g., the compute subsystem consisting of compute nodes). L2 links are called "global links". Each optical cable carries two links per direction.
  • L2 cables also interconnect subsystems - there are 3 subsystem on a HPE Cray EX system, each with its own dragonfly interconnect topology: compute, IO, and service subsystems.

NIC

  • PCIe 4.0 connection to nodes
  • 200G (25 GB/s) bandwidth
  • 1x NIC per node for CPU partition
  • 4x NICs per node for GPU partition

Node Specifications

GPU nodes

Perlmutter GPU nodes

A100's NVLink

  • Single AMD EPYC 7763 (Milan) CPU
  • 64 cores per CPU
  • Four NVIDIA A100 (Ampere) GPUs
  • PCIe 4.0 GPU-CPU connection
  • PCIe 4.0 NIC-CPU connection
  • 4 HPE Slingshot 11 NICs
  • 256 GB of DDR4 DRAM
  • 40 GB of HBM per GPU with
  • 1555.2 GB/s GPU memory bandwidth
  • 204.8 GB/s CPU memory bandwidth
  • 12 third generation NVLink links between each pair of gpus
  • 25 GB/s/direction for each link
Data type GPU TFLOPS
FP32 19.5
FP64 9.7
TF32 (tensor) 155.9
FP16 (tensor) 311.9
FP64 (tensor) 19.5

Further details:

CPU Nodes

Perlmutter CPU nodes

  • 2x AMD EPYC 7763 (Milan) CPUs
  • 64 cores per CPU
  • AVX2 instruction set
  • 512 GB of DDR4 memory total
  • 204.8 GB/s memory bandwidth per CPU
  • 1x HPE Slingshot 11 NIC
  • PCIe 4.0 NIC-CPU connection
  • 39.2 GFlops per core
  • 2.51 TFlops per socket
  • 4 NUMA domains per socket (NPS=4)

Further details:

Login nodes

  • 2x AMD EPYC 7713 (Milan) CPUs
  • 512 GB of memory in total
  • 1x NVIDIA A100 GPU with 40 GiB of memory
  • Two NICs connected via PCIe 4.0
  • 960 GB of local SSD scratch
  • 1 NUMA node per socket (NPS=1)

Large Memory nodes

Note

These nodes are not yet available for user jobs.

  • 2x AMD EPYC 7713 (Milan) CPUs
  • 1 TB of memory in total
  • 1x NVIDIA A100 GPU with 40 GiB of memory
  • Two NICs connected via PCIe 4.0
  • 960 GB of local SSD scratch
  • 1 NUMA node per socket (NPS=1)

Storage

The Perlmutter Scratch File System is an all-flash file system. It has 35 PB of disk space, an aggregate bandwidth of >5 TB/sec, and 4 million IOPS (4 KiB random). It has 16 MDS (metadata servers), 274 I/O servers called OSSs, and 3,792 dual-ported NVMe SSDs.

Perlmutter SCRATCH Usage details