Timezone: »
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to \emph{three orders of magnitude}, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification.
This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.
Author Information
Dan Alistarh (IST Austria)
Torsten Hoefler (ETH Zürich)
Mikael Johansson (KTH - Royal Institute of Technology)
Nikola Konstantinov (IST Austria)
Sarit Khirirat (KTH Royal Institute of Technology)
Cedric Renggli (ETH Zurich)
More from the Same Authors
-
2021 : Evaluating Bayes Error Estimators on Real-World Datasets with FeeBee »
Cedric Renggli · Luka Rimanic · Nora Hollenstein · Ce Zhang -
2021 : Poster: On the Impossibility of Fairness-Aware Learning from Corrupted Data »
Nikola Konstantinov · Christoph Lampert -
2023 Poster: Knowledge Distillation Performs Partial Variance Reduction »
Mher Safaryan · Alexandra Peste · Dan Alistarh -
2023 Poster: ZipLM: Inference-Aware Structured Pruning of Language Models »
Eldar Kurtić · Elias Frantar · Dan Alistarh -
2023 Poster: Bringing regularized optimal transport to lightspeed: a splitting method adapted for GPUs »
Jacob Lindbäck · Zesen Wang · Mikael Johansson -
2023 Poster: CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models »
Denis Kuznedelev · Eldar Kurtić · Elias Frantar · Dan Alistarh -
2021 : On the Impossibility of Fairness-Aware Learning from Corrupted Data »
Nikola Konstantinov · Christoph Lampert -
2021 Poster: On the Convergence of Step Decay Step-Size for Stochastic Optimization »
Xiaoyu Wang · Sindri Magnússon · Mikael Johansson -
2020 Poster: Scalable Belief Propagation via Relaxed Scheduling »
Vitalii Aksenov · Dan Alistarh · Janne H. Korhonen -
2020 Poster: Adaptive Gradient Quantization for Data-Parallel SGD »
Fartash Faghri · Iman Tabrizian · Ilia Markov · Dan Alistarh · Daniel Roy · Ali Ramezani-Kebrya -
2020 Poster: WoodFisher: Efficient Second-Order Approximation for Neural Network Compression »
Sidak Pal Singh · Dan Alistarh -
2020 Poster: On Convergence of Nearest Neighbor Classifiers over Feature Transformations »
Luka Rimanic · Cedric Renggli · Bo Li · Ce Zhang -
2020 Expo Demonstration: Using Sparse Quantization for Efficient Inference on Deep Neural Networks »
Mark J Kurtz · Dan Alistarh · Saša Zelenović -
2018 Poster: Byzantine Stochastic Gradient Descent »
Dan Alistarh · Zeyuan Allen-Zhu · Jerry Li -
2018 Poster: Continuous-time Value Function Approximation in Reproducing Kernel Hilbert Spaces »
Motoya Ohnishi · Masahiro Yukawa · Mikael Johansson · Masashi Sugiyama