Timezone: »
While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question why Adam converges significantly faster than SGD in these scenarios. In this paper, we explore one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms. We demonstrate the effect of coordinate-wise clipping in sharpness reduction and speeding up the convergence of optimization algorithms under various settings, and conclude that the sharpness reduction effect of adaptive coordinate-wise scaling is the reason for Adam’s success in practice.
Author Information
Yan Pan (CMU, Carnegie Mellon University)
Yuanzhi Li (CMU)
More from the Same Authors
-
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2022 : Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions »
Sitan Chen · Sinho Chewi · Jerry Li · Yuanzhi Li · Adil Salim · Anru Zhang -
2022 Poster: Towards Understanding the Mixture-of-Experts Layer in Deep Learning »
Zixiang Chen · Yihe Deng · Yue Wu · Quanquan Gu · Yuanzhi Li -
2022 Poster: The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning »
Zixin Wen · Yuanzhi Li -
2022 Poster: Vision Transformers provably learn spatial structure »
Samy Jelassi · Michael Sander · Yuanzhi Li -
2022 Poster: Learning (Very) Simple Generative Models Is Hard »
Sitan Chen · Jerry Li · Yuanzhi Li -
2021 Poster: Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels »
Stefani Karp · Ezra Winston · Yuanzhi Li · Aarti Singh -
2021 Poster: When Is Generalizable Reinforcement Learning Tractable? »
Dhruv Malik · Yuanzhi Li · Pradeep Ravikumar