Skip to yearly menu bar Skip to main content


Poster

Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Zhiyuan Li · Tianhao Wang · Dingli Yu

Hall J (level 1) #823

Keywords: [ weight decay ] [ Stochastic Gradient Descent ] [ stochastic differential equation ] [ Equilibrium ] [ mixing ]


Abstract: We prove the Fast Equilibrium Conjecture proposed by Li et al., (2020), i.e., stochastic gradient descent (SGD) on a scale-invariant loss (e.g., using networks with various normalization schemes) with learning rate η and weight decay factor λ mixes in function space in O~(1λη) steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al., (2021) and shows that for every T>0, the iterates of SGD with learning rate η and weight decay factor λ on the scale-invariant loss converge in distribution in Θ(η1λ1(T+ln(λ/η))) iterations as ηλ0 while satisfying ηO(λ)O(1). Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as T.

Chat is not available.