Timezone: »
Poster
The alignment property of SGD noise and how it helps select flat minima: A stability analysis
Lei Wu · Mingze Wang · Weijie Su
The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the sharpness---as measured by the Frobenius norm of the Hessian---is bounded \emph{independently} of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.
Author Information
Lei Wu (Peking University)
Mingze Wang (Peking University)
Weijie Su (Computer and Information Science and Wharton, University of Pennsylvania)
More from the Same Authors
-
2021 Spotlight: A Central Limit Theorem for Differentially Private Query Answering »
Jinshuo Dong · Weijie Su · Linjun Zhang -
2022 Poster: Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks »
Mingze Wang · Chao Ma -
2021 Poster: A Central Limit Theorem for Differentially Private Query Answering »
Jinshuo Dong · Weijie Su · Linjun Zhang -
2021 Poster: You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism »
Weijie Su -
2021 Poster: Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations »
Jiayao Zhang · Hua Wang · Weijie Su -
2020 Poster: Label-Aware Neural Tangent Kernel: Toward Better Generalization and Local Elasticity »
Shuxiao Chen · Hangfeng He · Weijie Su -
2020 Poster: The Complete Lasso Tradeoff Diagram »
Hua Wang · Yachong Yang · Zhiqi Bu · Weijie Su -
2020 Spotlight: The Complete Lasso Tradeoff Diagram »
Hua Wang · Yachong Yang · Zhiqi Bu · Weijie Su -
2019 Poster: Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing »
Zhiqi Bu · Jason Klusowski · Cynthia Rush · Weijie Su -
2019 Poster: Acceleration via Symplectic Discretization of High-Resolution Differential Equations »
Bin Shi · Simon Du · Weijie Su · Michael Jordan