Hyperparameter optimization (HPO) is for tuning the hyperparameters of your machine learning model. E.g., the learning rate, filter sizes, etc. There are several popular algorithms used for HPO including grid search, random search, Bayesian optimization, and genetic optimization. Similarly, there are several libraries and tools implementing these algorithms, each having their own tradeoffs in usability, flexibility, and feature support.
On this page we will collect recommendations and examples for running distributed HPO tasks on our HPC systems.
Cray provides an HPO library which integrates very naturally with the Cray systems. It can use SLURM to request and use an allocation and provides genetic search, random search, grid search, and population-based training.
The official Cray HPO documentation can be found here:
You can load the latest version on Cori with:
module load cray-hpo
You can find an example Jupyter notebook for genetic search here:
Tune is an open-source Python library for experiment execution and hyperparameter tuning at any scale. RayTune:
- supports any ML framework
- implements state of the art HPO strategies
- natively integrates with optimization libraries (HyperOpt, BayesianOpt, and Facebook Ax)
- 1ntegrates well with Slurm Handles trials micro scheduling on
- multi-gpu-node resources (no GPU binding boilerplate needed)
We provide RayTune in all of our GPU TensorFlow and PyTorch modules and shifter image. You can also use our sluarm-ray-cluster scripts for running multi-gpu nodes HPO campaigns, the repo also include a hello world MNIST example.