Frequently Asked Questions and Troubleshooting¶
If you have questions about Python at NERSC, please take a look at this collection of common user questions and problems. If this information does not help and you still have a problem, please open a ticket at
Is Python broken?¶
If Python seems broken or is exhibiting odd or unexpected behavior, the first thing to do is check your shell resource files (also known as dotfiles).
Some developers like to add things to their shell resource files (i.e.
.bash_profile) to avoid having to type things over and over again. Dotfiles can be a good resource but you should periodically check them to see if they need to be changed or updated. It is helpful to check these files for conflicting Python versions, modules, or additions to
PYTHONPATH that may be causing unexpected behavior in your Python setup.
conda takes forever to resolve and install packages¶
Try the mamba tool instead. It's already installed when you
module load python. You can use
mamba exactly how you'd use conda, but it's usually much faster.
Help, I'm over quota¶
Creating many conda environments and/or installing several packages can often use many GB of disk quota.
Cleaning up conda packages¶
The following command can be used to check the size of your conda environments and your conda package cache. If your conda environments or package cache directories are not using the default base path at
$HOME/.conda, then you will need to specify your custom paths instead.
du -csh $HOME/.conda/envs/* $HOME/.conda/pkgs
To delete unwanted conda environments:
conda env list conda env remove -n <env>
To delete unused conda files and packages:
conda clean -a
You may see many warnings when running
conda clean -a such as the following:
WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(143): Could not remove or rename /global/common/software/nersc/pm-2021q4/sw/python/3.9-anaconda-2021.11/pkgs/curl-7.78.0-h1ccaba5_0/info/about.json. Please remove this file manually (you may need to reboot to free file handles)
These warnings are probably safe to ignore.
conda is attempting to remove packages in the python module's package cache but does not have permission to do so. You can prevent
conda from attempting to remove those packages by explicity setting the path to search like so:
CONDA_PKGS_DIRS=$HOME/.conda/pkgs conda clean -a
Cleaning up pip packages¶
pip packages installed inside a conda environment are easily cleaned up when the conda environment is deleted.
pip packages installed via
pip --user (outisde a conda environment) are stored in
$HOME/.local/<system>/<python module version>, so feel free to delete some/all of the directories there to clean up space. This location is controlled by the environment variable
Break adjusted to free malloc space¶
If you see this error
*** Error in`python': break adjusted to free malloc space: 0x0000010000000000 ***
it most likely means you should rebuild your code and all dependent packages after
module unload craype-hugepages2M. If unloading this module doesn't help, please open a ticket so we can help you troubleshoot further.
/opt/mods/ and why is it in
You may have noticed that your default
elvis@cori07:~> echo $PYTHONPATH /opt/mods/lib/python3.6/site-packages:/opt/ovis/lib/python3.6/site-packages
/opt/mods part of this path enables our system-wide Python monitoring. If you allow
PYTHONPATH to remain set, we are able to collect data on your Python job and use it to make more informed decisions to better support Python users at NERSC. To learn more about the data we collect, please visit our MODS webpage.
Can I use my conda environment in Jupyter?¶
Yes! Your conda environment can easily become a Jupyter kernel. If you would like to use your custom environment
myenv in Jupyter:
source activate myenv conda install ipykernel python -m ipykernel install --user --name myenv --display-name MyEnv
Then when you log into
jupyter.nersc.gov you should see
MyEnv listed as a kernel option.
For more information about using your kernel at NERSC please see our Jupyter docs.
How can I fix my broken conda environment?¶
Conda environments are disposable. If something goes wrong, it is often faster and easier to build a new environment than to debug the old environment.
Can I install my own Anaconda Python "from scratch?"¶
Yes, you are welcome to build your own Python installation.
Can I use virtualenv on Cori?¶
The virtualenv tool is not compatible with the conda tool used for maintaining Anaconda Python. But this is not necessarily bad news as conda is an excellent replacement for virtualenv and addresses many of its shortcomings. And of course, there is nothing preventing you from doing a from-source installation of Python of your own, and then using virtualenv if you prefer.
Why does my
mpi4py time out? Or why is it so slow?¶
mpi4py on a large number of nodes can become slow due to all the metadata that must move across our filesystems. You may experience timeouts that look like this:
srun: job 33116771 has been allocated resources Mon Aug 3 18:24:50 2020: [PE_224]:inet_connect:inet_connect: connect failed after 301 attempts Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_inet_setup:inet_connect failed Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_init:_pmi_inet_setup (full) returned -1 [Mon Aug 3 18:24:50 2020] [c0-0c2s7n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(537): MPID_Init(246).......: channel initialization failed MPID_Init(647).......: PMI2 init failed: 1
Easy (but temporary) fix:
but this doesn't fix the problem, it just gives you more time to start up.
Medium fix: move your software stack to
Hard (but most effective fix): use
mpi4py in a Shifter container
Why is my code slow?¶
First, please review our brief overview of filesystem best practices at NERSC. Moving to Shifter or a different filesystem may substantially improve your performance. If this doesn't help, you can consider profiling your code.
How can I checkpoint my Python code?¶
Checkpointing your code can make your workflow more robust to:
- System issues. If your job crashes because of a system issue, you will be able to restart the checkpointed calculation in a resubmitted job later and it can pick up where it left off.
- User error. The most common use case here is that the calculation takes longer than the user expected when the job was submitted, and doesn't finish before the time limit.
- Preemption. Some HPC systems offer preemptable queues, where jobs can be run with discount charging because they may be interrupted for higher priority jobs. If your code can be preempted because it can checkpoint, you can take advantage of discount charging or submit shorter jobs. The net effect may be actually faster throughput for your workflow.
This example repo demonstrates one simple way to add graceful error handling and checkpointing to a Python code. Note, mpi4py jobs must be run with srun on Cori. For example:
srun -n 2 ./main.py
is suitable for checkpointing. For checkpointing to work, other Python jobs must be run with exec:
so that the
SIGINT signal will be forwarded. (Bash will not do this.) The InterruptHandler class in this example demonstrates how to catch
SIGINT, checkpoint your work, and shut down if necessary.