Getting Started and Optimization Strategy¶
There are several important differences between the Cori KNL ("Knight's Landing", or Xeon Phi) node architecture and the Xeon architecture (Cori Haswell) nodes. This page will walk you through the high-level steps to prepare an application to perform well on Cori KNL.
KNL vs Haswell¶
Cori KNL is a "many-core" architecture, meaning that instead of a few cores optimized for latency-sensitive code, Cori KNL nodes have many (68) cores optimized for vectorized code. Some key differences are:
|Cori Intel Xeon Phi (KNL)||Cori Haswell (Xeon)|
|68 physical cores on one socket||16 physical cores on each of two sockets (32 total)|
|272 virtual cores per node||64 virtual cores per node|
|1.4 GHz||2.3 GHz|
|8 double precision operations per cycle||4 double precision operations per cycle|
|96 GB of DDR memory and 16 GB of MCDRAM high-bandwidth memory||128 GB of DDR memory|
|~450 GB/sec memory bandwidth (MCDRAM)|
|512-bit wide vector units||256-bit-wide vector units|
Optimizing for KNL¶
There are three important areas of improvement to consider for Cori KNL:
- Evaluating and improving your Vector Processing Unit (VPU) utilization and efficiency. As shown in the table above, the Cori processors have an 8 double-precision wide vector unit. Meaning, if your code produces scalar, rather than vector instructions, you miss on a potential 8x speedup. Vectorization is described in more detail in Vectorization.
- Identifying and adding more node-level parallelism and exposing it in your application. An MPI+X programming approach is encouraged where MPI represents a layer of internode communication and X represents a conscious intra-node parallelization layer where X could again stand for MPI or for OpenMP, pthreads, PGAS etc.
- Evaluating and optimizing for your memory bandwidth and latency requirements. Many codes run at NERSC are performance limited not by the CPU clock speed or vector width but by waiting for memory accesses. The memory hierarchy in Cori KNL nodes is different to that in Haswell nodes, so while memory bandwidth optimizations will benefit both, different optimizations will benefit each architecture differently.