Timezone: »
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~\cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.
Author Information
Pan Zhou (Salesforce)
Jiashi Feng (National University of Singapore)
Chao Ma (Princeton University)
Caiming Xiong (Salesforce)
Steven Chu Hong Hoi (Salesforce)
Weinan E (Princeton University)
More from the Same Authors
-
2020 : Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML »
Pan Zhou -
2021 Spotlight: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation »
Junnan Li · Ramprasaath Selvaraju · Akhilesh Gotmare · Shafiq Joty · Caiming Xiong · Steven Chu Hong Hoi -
2021 Spotlight: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2022 Poster: CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning »
Hung Le · Yue Wang · Akhilesh Deepak Gotmare · Silvio Savarese · Steven Chu Hong Hoi -
2021 : Weinan E - Machine Learning and PDEs »
Weinan E -
2021 Workshop: Distribution shifts: connecting methods and applications (DistShift) »
Shiori Sagawa · Pang Wei Koh · Fanny Yang · Hongseok Namkoong · Jiashi Feng · Kate Saenko · Percy Liang · Sarah Bird · Sergey Levine -
2021 Poster: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation »
Junnan Li · Ramprasaath Selvaraju · Akhilesh Gotmare · Shafiq Joty · Caiming Xiong · Steven Chu Hong Hoi -
2021 Poster: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2020 Poster: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Oral: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Poster: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts »
Guilin Li · Junlei Zhang · Yunhe Wang · Chuanjian Liu · Matthias Tan · Yunfeng Lin · Wei Zhang · Jiashi Feng · Tong Zhang -
2020 Poster: Online Structured Meta-learning »
Huaxiu Yao · Yingbo Zhou · Mehrdad Mahdavi · Zhenhui (Jessie) Li · Richard Socher · Caiming Xiong -
2020 Poster: Improving Generalization in Reinforcement Learning with Mixture Regularization »
KAIXIN WANG · Bingyi Kang · Jie Shao · Jiashi Feng -
2020 Poster: Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation »
Jianfeng Zhang · Xuecheng Nie · Jiashi Feng -
2020 Poster: Towards Understanding Hierarchical Learning: Benefits of Neural Representations »
Minshuo Chen · Yu Bai · Jason Lee · Tuo Zhao · Huan Wang · Caiming Xiong · Richard Socher -
2020 Poster: ConvBERT: Improving BERT with Span-based Dynamic Convolution »
Zi-Hang Jiang · Weihao Yu · Daquan Zhou · Yunpeng Chen · Jiashi Feng · Shuicheng Yan -
2020 Spotlight: ConvBERT: Improving BERT with Span-based Dynamic Convolution »
Zi-Hang Jiang · Weihao Yu · Daquan Zhou · Yunpeng Chen · Jiashi Feng · Shuicheng Yan -
2019 Poster: LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition »
Zuxuan Wu · Caiming Xiong · Yu-Gang Jiang · Larry Davis -
2019 Poster: Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards »
Alexander Trott · Stephan Zheng · Caiming Xiong · Richard Socher -
2019 Poster: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2019 Spotlight: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2018 Poster: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: Efficient Stochastic Gradient Hard Thresholding »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective »
Lei Wu · Chao Ma · Weinan E -
2018 Poster: A^2-Nets: Double Attention Networks »
Yunpeng Chen · Yannis Kalantidis · Jianshu Li · Shuicheng Yan · Jiashi Feng -
2018 Poster: End-to-end Symmetry Preserving Inter-atomic Potential Energy Model for Finite and Extended Systems »
Linfeng Zhang · Jiequn Han · Han Wang · Wissam Saidi · Roberto Car · Weinan E -
2017 Poster: Dual Path Networks »
Yunpeng Chen · Jianan Li · Huaxin Xiao · Xiaojie Jin · Shuicheng Yan · Jiashi Feng -
2017 Spotlight: Dual Path Networks »
Yunpeng Chen · Jianan Li · Huaxin Xiao · Xiaojie Jin · Shuicheng Yan · Jiashi Feng -
2017 Poster: Multimodal Learning and Reasoning for Visual Question Answering »
Ilija Ilievski · Jiashi Feng -
2017 Poster: Learned in Translation: Contextualized Word Vectors »
Bryan McCann · James Bradbury · Caiming Xiong · Richard Socher -
2017 Poster: Predicting Scene Parsing and Motion Dynamics in the Future »
Xiaojie Jin · Huaxin Xiao · Xiaohui Shen · Jimei Yang · Zhe Lin · Yunpeng Chen · Zequn Jie · Jiashi Feng · Shuicheng Yan -
2017 Poster: Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis »
Jian Zhao · Lin Xiong · Panasonic Karlekar Jayashree · Jianshu Li · Fang Zhao · Zhecan Wang · Panasonic Sugiri Pranata · Panasonic Shengmei Shen · Shuicheng Yan · Jiashi Feng -
2016 Poster: Tree-Structured Reinforcement Learning for Sequential Object Localization »
Zequn Jie · Xiaodan Liang · Jiashi Feng · Xiaojie Jin · Wen Lu · Shuicheng Yan