SGD Convergence under Stepsize Shrinkage in Low-Precision Training
Abstract
Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor ( qk \in (0,1] ). We show that this shrinkage affect the usual stepsize ( \muk ) with an effective stepsize ( \muk qk ), slowing convergence when ( q{\min} < 1 ). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by ( q{\min} ), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.