Message Routing on the Cray Dragonfly Network¶
Cori employs the "Dragonfly" topology for the interconnection network. This topology is a group of interconnected local routers connected to other similar router groups by high-speed global links. The groups are arranged such that data transfer from one group to another requires only one route through a global link. For more information about the network please see the Cori Interconnect page.
Messages that travel between two nodes on the network are routed over the Cray Aries dragonfly network. Generally, on a low-traffic network, a message will take the shortest path between two nodes. But on a congested network, a message might take a longer route that takes less time than the direct route, taking a detour around a traffic jam.
Network congestion can significantly impact application performance on HPC systems. The Cray Aries dragonfly network implements adaptive routing to provide alternative routes in the presence of congestion. In the default case data traverses the minimal route through the network. However, as congestion is detected in the network, the traffic adjusts to take an alternative path as illustrated in the figure below:
In this figure, congestion (indicated by the lightning bolt) is detected on the red minimal path link between Node A and Node B, causing the data to take the alternate route (indicated in dark bold blue).
The switch from a minimal to non-minimal path in the network can be configured via several Cray environment variables: (1)
MPICH_GNI_ROUTING_MODE controls all the routing policy within Cray MPI except for all-to-all collectives, which are controlled by
MPICH_GNI_A2A_ROUTING_MODE. These environment variables can be set to the following:
ADAPTIVE_0: Least bias towards minimal; most likely to take alternate route in event of congestion
ADAPTIVE_1: Slight bias towards minimal
ADAPTIVE_2: Moderate bias towards minimal
ADAPTIVE_3: High bias towards minimal; least likely to take alternate route in event of congestion
Pros and Cons of Minimal and Non-Minimal Routing¶
Users wanting to try different routing modes should consider the pros and cons of the adaptive settings.
Minimal Bias (
ADAPTIVE_2) - Pros: - Lower best-case latency - Fewer false positives (end-point congestion can’t be avoided by routing) - Con: - Bisection-bandwidth bound applications will not perform as well
Non-minimal Bias (
ADAPTIVE_0) - Pro: - Alternate route to bypass intermediate congestion - Cons: - Switching routes may force a flush of data on the route -- incurring delay - Double best-case latency - If an application is creating congestion for itself, it may just propagate the congestion across more routers by taking the longer route, which in turn slows down other applications
These recommendations are for the most common job sizes at NERSC (512 nodes and under). While larger jobs may benefit from them as well, we do not have sufficient data to make a recommendation for full system jobs.
There is no single setting that is universally best for all applications. However, we have characterized the workloads that typically run on NERSC systems examining and believe that
ADAPTIVE_3 (high minimal bias) provides the best experience for the majority of our workloads and is the default setting on Cori. That is because many of our applications are limited by latency bound, small-message (e.g. 8 Byte)
MPI_allreduces or similar operations that are dependent on the slowest process. By selecting a strong preference for the minimal path you favor lower best-case latencies, but additionally reduce the likelihood of triggering a non-minimal route due to incast congestion. Examples of applications that benefit from high minimal bias are MILC.
If your application is both bandwidth intensive (this generally means message sizes > 16KiB) and communicates across the bisection of the network (think All-to-all operations, 3DFFT, transpose), your application may benefit from a stronger bias towards non-minimal routing. Many applications will not need to specify an alternative value for
MPICH_GNI_ROUTING_MODE since these bandwidth-intensive operations occur in
MPI_Alltoall (which is configured to a non-minimal bias by the separate variable
MPICH_GNI_A2A_ROUTING_MODE), however in some scenarios these operations may be implemented through point-to-point send/recv operations. In these instances better performance is generally achieved by setting
MPICH_GNI_ROUTING_MODE=ADAPTIVE_0. Examples in this category are applications such as HACC.
More information available about these and additional Cray MPI environment variables can be found via
man intro_mpi on Cori.