# Distributed Training¶

Guidelines prepared by Lei Shao, Victor Lee (Intel) and Thorsten Kurth, Prabhat (NERSC) under the Big Data Center collaboration.

## Motivation¶

• Scientific datasets can be large in volume and complex (multivariate, high dimensional)
• Models get bigger and more compute intensive as they tackle more complex tasks
• Scaling deep learning (DL) applications enables rapid prototyping / model evaluation

## Assumptions¶

• Ideally developed a good DL model from a randomly sampled subset of dataset and don’t need to change model further with full dataset
• It takes too long to train the DL model on full dataset with single node

## Data parallel or model parallel decision-making¶

• If DL model can fit onto one node, choose data parallel (e.g., horovod [2])
• If DL model is too large to run on one node and dataset can fit one node, choose model parallel, could try Mesh-Tensorflow [3]
• If both DL model and dataset are too large for one node, choose model-data-parallel [10], can also try Mesh-Tensorflow [3]

## How large dataset is required for scaling/training¶

• Dataset size: Different problem/dataset requires different dataset size. The larger dataset size, the better. Some examples can be found in the table below:
Project name Model type Model parameter numbers Dataset size comment
Etalumis 3DCNN+LSTM ~171 Millions 15 Million Sherpa simulation execution traces as train dataset Can converge well
Etalumis 3DCNN+LSTM ~156.96 Millions 3 Million Sherpa simulation execution traces as train dataset Can converge well
CosmoFlow 3DCNN 7 Millions 99,456 subvolumes (each with 128^3 voxel) for training dataset with 2X duplication can converge
ImageNet classification Resnet50 25Millions 1.28 million training images with ImageNet 2012 classification dataset with 1000 classes Can converge well

## How to increase dataset size¶

• Run more simulations
• Data augmentation, this largely depends on the invariances in your data. For example, some common augmentation transformations for image and object recognition tasks:
• Horizontal flips, random crops and scales, color jitter
• Random mix/combinations of: translation, rotation, stretching, shearing, lens distortions, ...

## Optimizer selection¶

• Continue to use the default optimizer as in single node case for multi-node scaling when global batch size is not scaled to too large
• Extreme large global batch size (model and dataset dependent): consider combining LARS [7]/LARC with base optimizer (e.g. Adam, SGD)
• Best accuracy: SGD with momentum (but may be difficult to tune hyper-parameters)
• Adam [11] optimizer is the most popular per-parameter adaptive learning rate optimizer, which works very well in most of use cases without the need of difficult learning rate tuning. And it works for both single node and multi-node case. We recommend users to give it a try.

## Learning rate scheduling¶

• Apply learning rate scaling for weak scaling with large batch size, e.g., linear scaling [9], sqrt scaling [7]
• Use learning rate warm up [9] when scaling the DL training to multi-nodes with larger global batch size. Start with single worker/rank/node LR and scale up to desired value linearly over a couple of epochs
• Consider adding learning rate decay schedule. Try step decay, exponential decay, 1/t decay, polynomial decay, cosine decay, etc.

## Synchronous SGD or Asynchronous SGD or hybrid SGD¶

• Sync SGD for proof of concept
• Async SGD for well-studied algorithm to further improve scaling efficiency
• Consider gradient lag-sync [8] (also named stale-synchronous or pipelining)
• Hybrid SGD (not straightforward with most frameworks)
• Recommendation: use synchronous SGD for reproduction and easy to converge

## Distributed training framework¶

• Horovod-MPI, Cray ML Plugin, Horovod-MLSL, etc
• GRPC is not recommended on HPC systems

## Batch size selection and node number selection¶

• Different workload (model, training algorithm, dataset) allows different maximum useful batch size, which is related to gradient noise scale [4]
• More complex datasets/tasks have higher gradient noise, thus can benefit from training with larger batch-sizes [4]
• For dataset size N, usually use maximal global batch size $\lt \sqrt{N}$
• Make sure the local batch size is not too small for computation efficiency
• Up to 64 nodes is recommended for shorter queue wait time
• More nodes are not necessarily better
• For weak scaling, learning rate and global batch size need to be scaled at the same time

Figure 2. The “simple noise scale” roughly predicts the maximum useful batch size for many ML tasks [4]

Figure 3. The relationship between steps to result and batch size has the same characteristic form for all problems [5]

Figure 4. The tradeoff between time and compute resources spent to train a model to a given level of performance takes the form of a Pareto frontier (left). (Right) a concrete example of the Pareto frontiers obtained from training a model to solve the Atari Breakout game to different levels of performance [4]

Figure 5. Effect of larger batch size on estimated gradients and training speed [4]

## References¶

1. AI and Compute, OpenAI blog
2. Horovod
3. Mesh-Tensorflow
4. McCandlish, Kaplan and Amodei, An Empirical Model of Large-Batch Training, arXiv:1812.06162
5. Sgallue, Lee, Antognini, Sohl-Dickstein, Frostig, Dahl, Measuring the Eﬀects of Data Parallelism on Neural Network Training, arXiv:1811.03600
6. Cray HPO
7. You, Gitman, Ginsburg, Large Batch Training of Convolutional Networks, arXiv:1708.03888
8. Kurth, et al, Exascale Deep Learning for Climate Analytics, arXiv:1810.01993
9. Goyal, et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 hour, arXiv:1706.02677
10. Kurth, Zhang, Satish, Mitliagkas, Racah, Patwary, Malas, Sundaram, Bhimji, Smorkalov, Deslippe, Shiryaev, Sridharan, Prabhat, Dubey : Deep Learning at 15PF: Supervised and Semi-supervised Classification for Scientific Data, arxiv:1708.05256
11. Kingma and Ba, Adam: A method for Stochastic Optimization, arXiv:1412.6980
12. IBM AutoAI