Frequently Asked Questions and Troubleshooting¶
If you have questions about Python at NERSC, please take a look at this collection of common user questions and problems.
If this information does not help and you still have a problem, please open a ticket at
help.nersc.gov with the following information:
- Are you using a Python module? If so which one?
- Which of our 5 Python options are you using?
- If you are using a custom conda environment, what is its name?
- Have you checked your shell resource files for anything that may be causing your issues?
- How can we reproduce your error?
If you can provide us this information right away, it will help us find and solve your problem more quickly.
Is Python broken?¶
If Python seems broken or is exhibiting odd or unexpected behavior, the first thing to do is check your shell resource files (also known as dotfiles).
Some developers like to add things to their shell resource files (i.e.
.bash_profile) to avoid having to type things over and over again. Ok we get it, nobody likes unnecessary typing. Dotfiles can be a good resource but you should periodically check them to see if they need to be changed or updated. It is helpful to check these files for conflicting Python versions, modules, or additions to
PYTHONPATH that may be causing unexpected behavior in your Python setup.
Should I use Python 2 or Python 3?¶
Python 3! Python 2 reached its end of life on Jan 1, 2020. Python 2 will remain on Cori for now, but will not be available on Perlmutter.
If you are still using Python 2 at NERSC, you may have noticed our warning:
ATTENTION: Python 2 reached end-of-life Jan 1, 2020. We urge you to transition to Python 3.
Developers of many packages including NumPy, SciPy, Matplotlib, pandas, and scikit-learn pledged to drop support for Python 2 "no later than 2020." You can expect support for all Python 2 libraries to continue to wither away. Using Python 2 past end of life is a risk as new issues will likely go unaddressed by developers. You may already have noticed deprecation warnings from your Python applications' outputs; please do not ignore these warnings.
Can I install my own Anaconda Python "from scratch?"¶
Yes. One reason you might consider this is that you want to install Anaconda Python on
/global/common/software or in a Shifter image to improve launch-time performance for large-scale applications. Or you might want more complete control over what versions of packages are installed and don't want to worry about whether NERSC will upgrade packages to versions that break backwards compatibility you depend on. See here for more information on how you can do this.
How do I use the Intel Distribution for Python at NERSC?¶
Intel Math Kernel Library (MKL), Data Analytics Acceleration Library (DAAL), Thread Building Blocks (TBB), and Integrated Performance Primitives (IPP) are available through Intel Community Licensing. This enabled both Continuum Analytics and Intel to provide access to Intel's performance libraries through Python for free.
Create a conda environment for your Intel Distribution for Python installation:
module load python conda create -n idp -c intel intelpython3_core python=3 source activate idp
Can I use virtualenv on Cori?¶
The virtualenv tool is not compatible with the conda tool used for maintaining Anaconda Python. But this is not necessarily bad news as conda is an excellent replacement for virtualenv and addresses many of its shortcomings. And of course, there is nothing preventing you from doing a from-source installation of Python of your own, and then using virtualenv if you prefer.
Why does my
mpi4py time out? Or why is it so slow?¶
mpi4py on a large number of nodes can become slow due to all the metadata that must move across our filesystems. You may experience timeouts that look like this:
srun: job 33116771 has been allocated resources Mon Aug 3 18:24:50 2020: [PE_224]:inet_connect:inet_connect: connect failed after 301 attempts Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_inet_setup:inet_connect failed Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_init:_pmi_inet_setup (full) returned -1 [Mon Aug 3 18:24:50 2020] [c0-0c2s7n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(537): MPID_Init(246).......: channel initialization failed MPID_Init(647).......: PMI2 init failed: 1
Easy (but temporary) fix:
but this doesn't fix the problem, it just gives you more time to start up.
Medium fix: move your software stack to
Hard (but most effective fix): use
mpi4py in a Shifter container
Why is my code slow?¶
First, please review our brief overview of filesystem best practices at NERSC here. Moving to Shifter or a different filesystem may substantially improve your performance. If this doesn't help, you can consider profiling your code. We provide a lot of information and examples here.
Can I use my conda environment in Jupyter?¶
Yes! Your conda environment can easily become a Jupyter kernel. If you would like to use your custom environment
myenv in Jupyter:
source activate myenv conda install ipykernel python -m ipykernel install --user --name myenv --display-name MyEnv
Then when you log into
jupyter.nersc.gov you should see
MyEnv listed as a kernel option.
For more information about using your kernel at NERSC please see our Jupyter docs.
My conda environments have put me over quota-- what do I do?¶
Conda and all its related files and packages can really add up. If you are installing packages to $HOME an exceed your quota (you can check via
myquota), cleaning up your conda files can make a big difference:
conda clean --all
will clean up all unused files and packages. See here for more information about
How can I fix my broken conda environment?¶
Conda environments are disposable. If something goes wrong, it is often faster and easier to build a new environment than to debug the old environment.
Can I use pip at NERSC?¶
Yes. For more information about using pip at NERSC please see here.
How can I checkpoint my Python code?¶
Checkpointing your code can make your workflow more robust to:
- System issues. If your job crashes because of a system issue, you will be able to restart the checkpointed calculation in a resubmitted job later and it can pick up where it left off.
- User error. The most common use case here is that the calculation takes longer than the user expected when the job was submitted, and doesn't finish before the time limit.
- Preemption. Some HPC systems offer preemptable queues, where jobs can be run with discount charging because they may be interrupted for higher priority jobs. If your code can be preempted because it can checkpoint, you can take advantage of discount charging or submit shorter jobs. The net effect may be actually faster throughput for your workflow.
This example repo demonstrates one simple way to add graceful error handling and checkpointing to a Python code. Note, mpi4py jobs must be run with srun on Cori. For example:
srun -n 2 ./main.py
is suitable for checkpointing. For checkpointing to work, other Python jobs must be run with exec:
so that the
SIGINT signal will be forwarded. (Bash will not do this.) The InterruptHandler class in this example demonstrates how to catch
SIGINT, checkpoint your work, and shut down if necessary.