Python on Cori KNL

The many-core Intel Xeon Phi (Knights Landing, KNL) architecture presents new opportunities for science applications in terms of scaling within a compute node from parallelism at the thread and vector register level. However, for Python applications, the KNL architecture poses numerous challenges as well.

How important is it for me to migrate Python code to KNL?

How long do you expect your code to last?

If the performance of your code on Haswell is satisfactory and the life cycle of your project is such that you will be able to continue using those architectures for the next few years, you may not need to worry about KNL right now at all. However, if plans for your code continued development and use beyond 2020, you may want to think about what KNL and similar architectures mean for your application.

How quickly do you need your results?

Another question to consider is queue wait time. Given the popularity of the Cori Haswell nodes, and well as the considerably larger KNL partition (~4x larger than the Haswell partition), you may find that you can run your jobs with much faster turnaround in the KNL queue.

Won't the developers of Python just fix all these problems for us?

Perhaps-- but so far, they have not. On the other hand, we suggest the risk from taking such a cavalier attitude is too great for users of Python in HPC. And in general we always recommend future-proofing code.

  • Python applications that do not already take advantage of on-node parallelism on Cori's Haswell processors can be expected to deliver markedly worse performance on Cori KNL. While the KNLs are more energy-efficient, their lower clock rate and instructions retired per cycle are much lower than other architectures.
  • Code written in Python that takes advantage of threaded "performance" libraries written in C, C++, or Fortran (along the lines of the Ousterhout Dichotomy) may be able to take advantage of the larger number of CPUs per node. Such libraries, such as numpy or scipy built on top of Intel MKL, use OpenMP to deliver thread-level parallelism and include specialized vectorization instructions.
  • However, using threaded and vectorized performance library calls may not be enough. If Python code can make calls to performance libraries then a computational bottleneck at the Python interpreter level arises (see Amdahl's Law). Therefore it is important that code spend as much time as possible in doing computations in threaded/vectorized performance libraries.

The above issues (and others) should give Python developers pause. Python is a powerful language for productive programming in the sciences and data analysis, but this increased productivity often comes at the price of decreased performance.

This doesn't mean that you should abandon Python. It does however mean that you will need to work to obtain performance on future architectures like KNL. Below we discuss tools, techniques, and skills that Python developers can adopt or learn that may help them migrate code to Cori KNL. Much of the information presented here comes from our input from the Intel Python team and from work done in the NESAP for Data program.

Suggestion 1 - Consider the Intel Distribution for Python

As documented here NERSC provides software modules for Anaconda Python, including both Python 2 (until Jan 1 2020) and Python 3. The Anaconda Python distribution includes a number of optimized libraries that leverage Intel's expertise, particularly the Intel Math Kernel Library (MKL). These libraries include numpy, scipy, numba, etc. Most importantly these libraries are threaded and include vector optimizations critical for maximum performance on KNL.

In 2016 Intel released their own distribution of Python. The relationship between Intel Python and Anaconda Python is very close. Python users should not view the two products as necessarily being developed in competition; rather Continuum Analytics (the company behind Anaconda Python) and the Intel Python team work closely together to deliver maximum performance of Intel's hardware to Python users through a collaborative effort.

The Intel Python Distribution provides the above MKL optimizations but in addition provides TBB (Thread Building Blocks library) and interfaces to the DAAL (Data Analytics Acceleration Layer). TBB in particular enables users to compose threads across threaded library calls and avoid thread oversubscription.

Users can try the Intel Distribution for Python through a conda environment. At NERSC you can use the following procedure:

module load python/2.7-anaconda-2019.07
conda create -n idp -c intel intelpython2_core python=2
source activate idp

Suggestion 2 - Add Numba to your code

What does KNL have that Haswell doesn't? Vector units. If you can effectively utilize these vector units, you'll be tapping into one of the major strengths that KNL has to offer. "But", you ask, "how can I do this from Python?"

Fear not fellow Python programmer! If you are calling optimized libraries like numpy and scipy, they should already be taking advantage of vectorization where possible. In other cases, Numba can help you vectorize with ease. Numba is a library that helps compile Python code either ahead of time (AOT) or Just in time (JIT). It uses the LLVM compiler infrastructure to compile your selected Python function(s) at runtime and usually results in anywhere from modest to major speedups. On KNL these speedups often come from the compiled code's ability to use the vector units.

Here is an example that demonstrates the power of vectorization on KNL. This is a Numba-ized function that performs substantially faster (~16x) on KNL than on Haswell (~5x):

@numba.jit(nopython=True,cache=True)
def legval_speedup(x, c):
    nd = len(c)
    c0 = c[-2]*np.ones(len(x))
    c1 = c[-1]*np.ones(len(x))
    for i in range(3, len(c) + 1):
        tmp = c0
        nd = nd - 1
        nd_inv = 1/nd
        c0 = c[-i] - (c1*(nd - 1))*nd_inv
        c1 = tmp + (c1*x*(2*nd - 1))*nd_inv
    return c0 + c1*x
By adding Numba to a few key functions, you might be surprised at how much this can improve your runtime on KNL. For more information on profiling and optimizing your Python code, please see our profiling page.

Known KNL Issues

Load imbalance in multiprocessing

In Python multiprocessing, process spawning is delegated to the operating system and processes are distributed with some load imbalance. On the Haswell architecture, using a multiprocessing Pool this is not really noticeable; on KNL the load imbalance can be substantial. For more about using multiprocessing at NERSC please see this page.

Environment Variable: KMP_AFFINITY=disabled

For process-level threading (e.g., multiprocessing) to work in Python on KNL, users are advised to set the KMP_AFFINITY variable to "disabled" as follows in bash:

export KMP_AFFINITY=disabled
This is especially important if there are any calls to performance libraries with OpenMP regions in them. The reason is that the first OpenMP region creates a CPU affinity mask that later prevents processes (not OpenMP threads) from spawning off of the master CPU. The symptom is total lack of scaling as the number of requested processes increases.

This is a known issue in Intel's OpenMP release that should be addressed with the next release of Intel OpenMP later in 2017. Until that release is made and installed at NERSC we advise users to use the setting above.