Many scientists appreciate Python's power for prototyping and developing scientific computing and data-intensive applications. However, creating parallel Python applications that scale well in modern high-performance computing environments can be challenging.
How to achieve parallelism in Python?¶
This is a complex question. There are several general methodologies by which you can obtain parallelism in Python. Here is a high-level overview.
- Process-level parallelism
- Thread-level parallelism
Higher-level frameworks that may use one or both of process- and thread-level parallelism:
Importing Python Packages at Scale¶
When a Python application imports a module, the interpreter executes system calls to locate the module on the file system and expose its contents to the application. The number of system calls per "import" statement may be very large because Python searches for each import target across many paths. For instance,
import scipy on Cori presently opens or attempts to open about 4000 files.
When a large number of Python tasks are running simultaneously, especially if they are launched with MPI, the result is many tasks trying to open the same files at the same time, causing contention and degradation of performance. Python applications running at the scale of a few hundred or a thousand tasks may take an unacceptable amount of time simply starting up.
To overcome this problem, NERSC strongly advises users to build a Docker image containing their Python stack and use Shifter to run it. This is the best approach to overcome the at-scale import bottleneck at NERSC. Shifter and alternatives are described below.
Shifter- The Best Way to Run Python at Scale¶
Shifter is a technology developed at NERSC to provide scalable Linux container deployment in a high-performance computing environment. Shifter is very similar to Docker but without root privileges (which is necessary at a place like NERSC).
The idea is to package up an application and its entire software stack into a Linux container and distribute these containers to the compute nodes. This localizes the modules and any associated shared object libraries to the compute nodes, eliminating contention on the shared file system. Using Shifter results in tremendous speed-ups for launching larger process-parallel Python applications.
For more information about how to build and use mpi4py in a Shifter container, please see here.
The /global/common file system is where NERSC projects share a directory (for example, m1759). This file system is mounted read-only from compute nodes with client-side caching enabled. These features help mitigate the import bottleneck. Performance can be almost as good as Shifter, but Shifter is always the best choice. Users can also install software to this read-optimized file system. Read more about how to do that here.
$SCRATCH or Community¶
The $SCRATCH file system is optimized for large-scale data I/O. It generally has better performance than, say, the Community file system but for importing Python packages the performance is not as good as with /global/common. Both the $SCRATCH file system and Community also experience variable load. Users may install Python packages to the $SCRATCH file system but must remember the purge policy: we do not recommend doing this. While Community has no purge policy, performance during Python import at scale is worse than on $SCRATCH so NERSC also does not recommend installing Python stacks to Community.
There are a few other interventions that we are aware of that can help users scale their Python applications at NERSC. One is to bundle up Python and the dependency stack and broadcast it to the compute nodes where it is placed into /dev/shm. This has been described here.