# Cori Large Memory Nodes (aka cmem Nodes)¶

Cori has a set of 20 nodes, each with 2 TB of memory and a 3.0 GHz AMD EPYC 7302 (Rome) processor. The nodes are available to high-priority scientific or technical campaigns that have a special need for this hardware. The initial focus is on supporting COVID-19 related research and preparing for the Perlmutter system (which will have a similar AMD processor).

## Node Features¶

As stated above, there are twenty large memory nodes: cmem01, ..., cmem20. Each of these nodes contains

• Two sockets, each populated with one 16-core AMD EPYC 7302 (Rome) processor running at 3.0 GHz
• Theoretical double-precision peak speed of 48 Gflops per core and 1.536 Tflops per node
• 2 TB of RAM
• 3 TB of NVMe SSD local scratch disk, mounted as /tmp

AMD EPYC processors use a multi-chip module (MCM) design where separate dies are provided for CPU and I/O components for easier scalability. The CPU dies are called CCDs (Core Complex Dies) and the IO dies are called IODs.

An AMD Zen2 core in the Rome processor can support Simultaneous Multithreading (SMT), allowing 2 execution threads (aka hardware threads) to execute simultaneously per core. Each core has its own 32-KB L1 data and 512-KB L2 caches.

Four cores share a single 16-MB L3 cache, and they are grouped as a modular unit called Core-Complex (CCX). For this Rome processor, only 2 cores are active, and, therefore, the L3 cache is actually shared by the two.

A CCD contains two CCXs, as depicted in the diagram below.

The EPYC 7302 processor has four CCDs and one IOD per socket, as shown below. All dies interconnect with each other via AMD's Infinity Fabric, sometimes referred to as the Global Memory Interconnect (GMI).

The CCDs connect to memory, I/O, and each other through the IOD. A Rome processor supports 8 memory controllers. Each memory controller supports 2 DIMMs (3200 MHz DDR4), for the maximum memory bandwidth of 409.6 GB/s per socket.

The IOD can be configured for different NUMA node topologies. In case of the EPYC 7302 processor, it can be configured for 4, 2, and 1 NUMA nodes per socket as well as a single NUMA domain over the entire two sockets. These are denoted by NPS4, NPS2, NPS1, and NPS0, respectively. In addition, there is an option of exposing each L3 cache as a NUMA node, in which case a large memory node would have 16 NUMA nodes.

The current configuration for the large memory nodes is NPS1.

## File Systems¶

The usual Cori file systems are available, including

Each node also has

• A 3 TB local /tmp SSD partition.

This file system can be used for fast I/O with input and output files for your runs. As a proxy for checking I/O speed of the file system, we use IOR, and below are some MPI-IO rates in GB/sec, from 32-process runs on one node, with the transfer size and block size of 1 MB. Here SSF is for using a single shared file for the collective I/O, and FPP for using a separate file per process.

1 GB 6.4 2.4 86.1 40.1
32 GB 5.1 2.2 171.5 56.4
1 TB 4.8 1.8 176.7 5.4

Applications with different IO patterns or running on a shared node may see different results.

Warning

Files in /tmp are not persistent - they are removed when your job finishes.

## Network¶

The large memory nodes are not connected to Cori’s Aries high speed network. Multi-node applications can use Open MPI to communicate over an InfiniBand network.

## Programming Environment¶

The user environment is similar to a Cori login node. However, the large memory nodes have AMD processors, unlike Cori, which has Intel processors. You will need to (re)compile your codes to run on the AMD hardware.