Timezone: »

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement
Heyang Qin · Samyam Rajbhandari · Olatunji Ruwase · Feng Yan · Lei Yang · Yuxiong He

Thu Dec 09 04:30 PM -- 06:00 PM (PST) @

Large scale training requires massive parallelism to finish the training within a reasonable amount of time. To support massive parallelism, large batch training is the key enabler but often at the cost of generalization performance. Existing works explore adaptive batching or hand-tuned static large batching, in order to strike a balance between the computational efficiency and the performance. However, these methods can provide only coarse-grained adaption (e.g., at a epoch level) due to the intrinsic expensive calculation or hand tuning requirements. In this paper, we propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e.g., at a mini-batch level) that can achieve state-of-the-art performance with record breaking batch sizes. The core component of our method is a lightweight yet efficient representation of the critical gradient noise information. We open-source the proposed methodology by providing a plugin tool that supports mainstream machine learning frameworks. Extensive evaluations on popular benchmarks (e.g., CIFAR10, ImageNet, and BERT-Large) demonstrate that the proposed methodology outperforms state-of-the-art methodologies using adaptive batching approaches or hand-tuned static strategies in both performance and batch size. Particularly, we achieve a new state-of-the-art batch size of 78k in BERT-Large pretraining with SQuAD score 90.69 compared to 90.58 reported in previous state-of-the-art with 59k batch size.

Author Information

Heyang Qin (University of Nevada, Reno)
Samyam Rajbhandari (Microsoft)
Olatunji Ruwase (Microsoft)
Feng Yan (University of Nevada, Reno)
Lei Yang (University of Nevada, Reno)
Yuxiong He (Microsoft)

More from the Same Authors