Optimizing I/O on KNL¶
Cori KNL has different architecture from Cori Haswell. In order to achieve the best I/O performance, applications need to leverage the new features in KNL and avoid using the chip improperly. Based on our study and interaction with different applications, we have collected the best practices for KNL I/O.
Core Specialization can isolate system overhead to designated cores on a compute node
#SBATCH -S 4
With core specialization, we increased the 32 nodes HDF5 collective I/O bandwidth from 6.3GB/sec to 7.6 GB/sec.
In case node is not fully packed, process affinity is important in balancing workload and leveraging memory locality.
srun -n 4 -c 64 --cpu-bind=cores <application>
"That does the trick. The time for reading the WRF restart file is now 36 seconds, (was 300 seconds)" --John Michalakes, UCAR
Direct I/O vs. Buffered I/O¶
Direct I/O can bypass page buffer in the client side, allow data to be read from disk directly into user buffer, which in some cases is beneficial to the application's I/O. In our preliminary study, we found that KNL's page buffer management might be slower than Haswell, possibly due to slower CPU frequencies. The performance gap between KNL and Haswell becomes small when less I/O buffer layers are removed, as shown in the following plot.
Avoiding page buffer becomes reasonable for a few specific I/O patterns, e.g., highly randomly read, or large sequential read/write in volumes similar to or larger than that of the cache. Specifically, we have the following guidelines:
- Use default setting in most cases, e.g., small read, multiple read, and write, or more complex I/O pattern
- Consider using direct I/O if you want to manage the page buffer by yourself, or in case I/O is too large that page buffer and Lustre readahead benefits diminish, e.g., I/O per node is much larger than 40% (dirty ratio) of node memory.
Turning on direct I/O requires good knowledge of I/O interface and underneath file systems, as direct I/O has strict in memory alignment (512 bytes), I/O transaction size (multiple of 512 bytes), etc. Direct I/O can be turned on at different layer, e.g., MPIIO, Posix, or HDF5. Though it often falls back to default buffer I/O due to wrong configuration, we have tested the following options
open(), together with a prior
setenv MPIO_DIRECT_READ TRUE
The above options are programming interface, command line parameter, environmental variable, and programming interface separately. In the case of HDF5, we saw as much as 11% speedup.
Collective I/O is one of the most important I/O optimizations in ROMIO (an MPIIO implementation), tuning collective I/O buffer size is beneficial in aggregating small I/O transactions into large contiguous and reducing number of I/O transactions. Tuning collective I/O becomes also essential due to the fact that KNL has larger inter/intra-node communication latency than Haswell.
Better I/O bandwidth can be achieved with hand optimized buffer size depends on your own applications.
- Multi-process/cores (e.g., MPI) Given that single stream I/O on KNL is slower than Haswell, parallelizing the I/O with multiple cores can increase the bandwidth.
- Multi-threading (e.g., OpenMP) Exploring multi-threading is also beneficial to the I/O bandwidth.
The following results are from a multi-threading read test on KNL conducted by Dr. Elliott Slaughter (Stanford). The benchmark code simulates the Stanford Legion program, and simple contiguous pread. The first plot shows on Cori KNL (writing to
$SCRATCH), 16 to 32 threads on a single core can saturate the bandwidth (see the red bar), and interestingly, if binding multi-threads to multi-cores, the I/O on KNL is pretty scalable to full rack (note that there are 4 specialization cores preallocated). In the second plot, the I/O on burst buffer on a single KNL core is able to saturate the I/O bandwidth.
Reference: Understanding the I/O Performance Gap Between Cori KNL and Haswell, CUG 2017