Timezone: »
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
Xingyu Xie · Pan Zhou · Huan Li · Zhouchen Lin · Shuicheng Yan
Event URL: https://openreview.net/forum?id=HQJEobVV1i »
Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first- and second-order moments of the gradient for accelerating convergence. But Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm (Adan) to speed up the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method that avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order gradient moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate stationary point within $\order{\epsilon^{-4}}$ stochastic gradient complexity on the non-convex stochastic problems, matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers for vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, e.t.c, and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k.
Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first- and second-order moments of the gradient for accelerating convergence. But Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm (Adan) to speed up the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method that avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order gradient moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate stationary point within $\order{\epsilon^{-4}}$ stochastic gradient complexity on the non-convex stochastic problems, matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers for vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, e.t.c, and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k.
Author Information
Xingyu Xie (Peking University)
Pan Zhou (SEA AI Lab)
Currently, I am a senior Research Scientist in Sea AI Lab of Sea group. Before, I worked in Salesforce as a research scientist during 2019 to 2021. I completed my Ph.D. degree in 2019 at the National University of Singapore (NUS), fortunately advised by Prof. Jiashi Feng and Prof. Shuicheng Yan. Before studying in NUS, I graduated from Peking University (PKU) in 2016 and during this period, I was fortunately directed by Prof. Zhouchen Lin and Prof. Chao Zhang in ZERO Lab. During the research period, I also work closely with Prof. Xiaotong Yuan. I also spend several wonderful months in 2018 at Georgia Tech as visiting student hosted by Prof. Huan Xu.
Huan Li (Peking University)
Zhouchen Lin (Peking University)
Shuicheng Yan (Sea AI Lab)
More from the Same Authors
-
2020 : Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML »
Pan Zhou -
2021 Spotlight: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2021 Spotlight: Training Feedback Spiking Neural Networks by Implicit Differentiation on the Equilibrium State »
Mingqing Xiao · Qingyan Meng · Zongpeng Zhang · Yisen Wang · Zhouchen Lin -
2022 Poster: Rethinking Knowledge Graph Evaluation Under the Open-World Assumption »
Haotong Yang · Zhouchen Lin · Muhan Zhang -
2022 Poster: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 : Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms »
Pan Zhou · Xingyu Xie · Shuicheng Yan -
2022 : DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD »
Jingyang Li · Pan Zhou · Kuangyu Ding · Kim-Chuan Toh · Yinyu Ye -
2022 : Boosting Offline Reinforcement Learning via Data Resampling »
Yang Yue · Bingyi Kang · Xiao Ma · Zhongwen Xu · Gao Huang · Shuicheng Yan -
2022 : Mutual Information Regularized Offline Reinforcement Learning »
Xiao Ma · Bingyi Kang · Zhongwen Xu · Min Lin · Shuicheng Yan -
2022 : HloEnv: A Graph Rewrite Environment for Deep Learning Compiler Optimization Research »
Chin Yang Oh · Kunhao Zheng · Bingyi Kang · Xinyi Wan · Zhongwen Xu · Shuicheng Yan · Min Lin · Yangzihao Wang -
2022 : Efficient Offline Policy Optimization with a Learned Model »
Zichen Liu · Siyi Li · Wee Sun Lee · Shuicheng Yan · Zhongwen Xu -
2022 : Visual Imitation Learning with Patch Rewards »
Minghuan Liu · Tairan He · Weinan Zhang · Shuicheng Yan · Zhongwen Xu -
2023 Poster: Balance, Imbalance, and Rebalance: Understanding Robust Overfitting from a Minimax Game Perspective »
Yifei Wang · Liangchen Li · Jiansheng Yang · Zhouchen Lin · Yisen Wang -
2023 Poster: A Single-Loop Accelerated Extra-Gradient Difference Algorithm with Improved Complexity Bounds for Constrained Minimax Optimization »
Yuanyuan Liu · Fanhua Shang · Weixin An · Junhao Liu · Hongying Liu · Zhouchen Lin -
2023 Poster: GEQ: Gaussian Kernel Inspired Equilibrium Models »
Mingjie Li · Yisen Wang · Zhouchen Lin -
2023 Poster: Task-Robust Pre-Training for Worst-Case Downstream Adaptation »
Jianghui Wang · Yang Chen · Xingyu Xie · Cong Fang · Zhouchen Lin -
2023 Poster: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection »
Zhongzhan Huang · Pan Zhou · Shuicheng Yan · Liang Lin -
2023 Oral: A Single-Loop Accelerated Extra-Gradient Difference Algorithm with Improved Complexity Bounds for Constrained Minimax Optimization »
Yuanyuan Liu · Fanhua Shang · Weixin An · Junhao Liu · Hongying Liu · Zhouchen Lin -
2022 Spotlight: Lightning Talks 4A-3 »
Zhihan Gao · Yabin Wang · Xingyu Qu · Luziwei Leng · Mingqing Xiao · Bohan Wang · Yu Shen · Zhiwu Huang · Xingjian Shi · Qi Meng · Yupeng Lu · Diyang Li · Qingyan Meng · Kaiwei Che · Yang Li · Hao Wang · Huishuai Zhang · Zongpeng Zhang · Kaixuan Zhang · Xiaopeng Hong · Xiaohan Zhao · Di He · Jianguo Zhang · Yaofeng Tu · Bin Gu · Yi Zhu · Ruoyu Sun · Yuyang (Bernie) Wang · Zhouchen Lin · Qinghu Meng · Wei Chen · Wentao Zhang · Bin CUI · Jie Cheng · Zhi-Ming Ma · Mu Li · Qinghai Guo · Dit-Yan Yeung · Tie-Yan Liu · Jianxing Liao -
2022 Spotlight: Online Training Through Time for Spiking Neural Networks »
Mingqing Xiao · Qingyan Meng · Zongpeng Zhang · Di He · Zhouchen Lin -
2022 Spotlight: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 Spotlight: Lightning Talks 2B-1 »
Yehui Tang · Jian Wang · Zheng Chen · man zhou · Peng Gao · Chenyang Si · SHANGKUN SUN · Yixing Xu · Weihao Yu · Xinghao Chen · Kai Han · Hu Yu · Yulun Zhang · Chenhui Gou · Teli Ma · Yuanqi Chen · Yunhe Wang · Hongsheng Li · Jinjin Gu · Jianyuan Guo · Qiman Wu · Pan Zhou · Yu Zhu · Jie Huang · Chang Xu · Yichen Zhou · Haocheng Feng · Guodong Guo · yongbing zhang · Ziyi Lin · Feng Zhao · Ge Li · Junyu Han · Jinwei Gu · Jifeng Dai · Chao Xu · Xinchao Wang · Linghe Kong · Shuicheng Yan · Yu Qiao · Chen Change Loy · Xin Yuan · Errui Ding · Yunhe Wang · Deyu Meng · Jingdong Wang · Chongyi Li -
2022 Poster: Inducing Neural Collapse in Imbalanced Learning: Do We Really Need a Learnable Classifier at the End of Deep Neural Network? »
Yibo Yang · Shixiang Chen · Xiangtai Li · Liang Xie · Zhouchen Lin · Dacheng Tao -
2022 Poster: Towards Theoretically Inspired Neural Initialization Optimization »
Yibo Yang · Hong Wang · Haobo Yuan · Zhouchen Lin -
2022 Poster: EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine »
Jiayi Weng · Min Lin · Shengyi Huang · Bo Liu · Denys Makoviichuk · Viktor Makoviychuk · Zichen Liu · Yufan Song · Ting Luo · Yukun Jiang · Zhongwen Xu · Shuicheng Yan -
2022 Poster: Online Training Through Time for Spiking Neural Networks »
Mingqing Xiao · Qingyan Meng · Zongpeng Zhang · Di He · Zhouchen Lin -
2021 Poster: On Training Implicit Models »
Zhengyang Geng · Xin-Yu Zhang · Shaojie Bai · Yisen Wang · Zhouchen Lin -
2021 Poster: Dissecting the Diffusion Process in Linear Graph Convolutional Networks »
Yifei Wang · Yisen Wang · Jiansheng Yang · Zhouchen Lin -
2021 Poster: Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond »
Pan Zhou · Hanshu Yan · Xiaotong Yuan · Jiashi Feng · Shuicheng Yan -
2021 Poster: Gauge Equivariant Transformer »
Lingshen He · Yiming Dong · Yisen Wang · Dacheng Tao · Zhouchen Lin -
2021 Poster: Training Feedback Spiking Neural Networks by Implicit Differentiation on the Equilibrium State »
Mingqing Xiao · Qingyan Meng · Zongpeng Zhang · Yisen Wang · Zhouchen Lin -
2021 Poster: Efficient Equivariant Network »
Lingshen He · Yuxuan Chen · zhengyang shen · Yiming Dong · Yisen Wang · Zhouchen Lin -
2021 Poster: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2021 Poster: Residual Relaxation for Multi-view Representation Learning »
Yifei Wang · Zhengyang Geng · Feng Jiang · Chuming Li · Yisen Wang · Jiansheng Yang · Zhouchen Lin -
2020 Poster: Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning »
Pan Zhou · Jiashi Feng · Chao Ma · Caiming Xiong · Steven Chu Hong Hoi · Weinan E -
2020 Poster: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Oral: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Poster: ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding »
Yibo Yang · Hongyang Li · Shan You · Fei Wang · Chen Qian · Zhouchen Lin -
2020 Poster: Improving GAN Training with Probability Ratio Clipping and Sample Reweighting »
Yue Wu · Pan Zhou · Andrew Wilson · Eric Xing · Zhiting Hu -
2019 Poster: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2019 Spotlight: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2018 Workshop: NIPS 2018 workshop on Compact Deep Neural Networks with industrial applications »
Lixin Fan · Zhouchen Lin · Max Welling · Yurong Chen · Werner Bailer -
2018 Poster: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: Efficient Stochastic Gradient Hard Thresholding »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator »
Cong Fang · Chris Junchi Li · Zhouchen Lin · Tong Zhang -
2018 Spotlight: SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator »
Cong Fang · Chris Junchi Li · Zhouchen Lin · Tong Zhang -
2018 Poster: Joint Sub-bands Learning with Clique Structures for Wavelet Domain Super-Resolution »
Zhisheng Zhong · Tiancheng Shen · Yibo Yang · Zhouchen Lin · Chao Zhang -
2017 Poster: Faster and Non-ergodic O(1/K) Stochastic Alternating Direction Method of Multipliers »
Cong Fang · Feng Cheng · Zhouchen Lin -
2015 Poster: Accelerated Proximal Gradient Methods for Nonconvex Programming »
Huan Li · Zhouchen Lin