Timezone: »
DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD
Jingyang Li · Pan Zhou · Kuangyu Ding · Kim-Chuan Toh · Yinyu Ye
Event URL: https://openreview.net/forum?id=6ofsu7wzSMm »
Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models at the expense of inferior generalization performance. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate according to some loss curvature, and indeed decomposes the $n$-dimensional gradient into $n$ standard basis directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along $k\ (\ll \! n)$ descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for $k$ descent directions by solving a trust-region subproblem whose computational overhead is negligible since the trust-region subproblem is low-dimensional, e.g. $k=2$ in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.
Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models at the expense of inferior generalization performance. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate according to some loss curvature, and indeed decomposes the $n$-dimensional gradient into $n$ standard basis directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along $k\ (\ll \! n)$ descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for $k$ descent directions by solving a trust-region subproblem whose computational overhead is negligible since the trust-region subproblem is low-dimensional, e.g. $k=2$ in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.
Author Information
Jingyang Li (National University of Singapore)
Pan Zhou (SEA AI Lab)
Currently, I am a senior Research Scientist in Sea AI Lab of Sea group. Before, I worked in Salesforce as a research scientist during 2019 to 2021. I completed my Ph.D. degree in 2019 at the National University of Singapore (NUS), fortunately advised by Prof. Jiashi Feng and Prof. Shuicheng Yan. Before studying in NUS, I graduated from Peking University (PKU) in 2016 and during this period, I was fortunately directed by Prof. Zhouchen Lin and Prof. Chao Zhang in ZERO Lab. During the research period, I also work closely with Prof. Xiaotong Yuan. I also spend several wonderful months in 2018 at Georgia Tech as visiting student hosted by Prof. Huan Xu.
Kuangyu Ding (National University of Singapore)
Kim-Chuan Toh
Yinyu Ye (Standord)
More from the Same Authors
-
2020 : Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML »
Pan Zhou -
2021 Spotlight: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2022 Poster: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 : Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models »
Xingyu Xie · Pan Zhou · Huan Li · Zhouchen Lin · Shuicheng Yan -
2022 : Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms »
Pan Zhou · Xingyu Xie · Shuicheng Yan -
2022 : How Small Amount of Data Sharing Benefits Higher-Order Distributed Optimization and Learning »
Mingxi Zhu · Yinyu Ye -
2023 Poster: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection »
Zhongzhan Huang · Pan Zhou · Shuicheng Yan · Liang Lin -
2022 Spotlight: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 Spotlight: Lightning Talks 2B-1 »
Yehui Tang · Jian Wang · Zheng Chen · man zhou · Peng Gao · Chenyang Si · SHANGKUN SUN · Yixing Xu · Weihao Yu · Xinghao Chen · Kai Han · Hu Yu · Yulun Zhang · Chenhui Gou · Teli Ma · Yuanqi Chen · Yunhe Wang · Hongsheng Li · Jinjin Gu · Jianyuan Guo · Qiman Wu · Pan Zhou · Yu Zhu · Jie Huang · Chang Xu · Yichen Zhou · Haocheng Feng · Guodong Guo · yongbing zhang · Ziyi Lin · Feng Zhao · Ge Li · Junyu Han · Jinwei Gu · Jifeng Dai · Chao Xu · Xinchao Wang · Linghe Kong · Shuicheng Yan · Yu Qiao · Chen Change Loy · Xin Yuan · Errui Ding · Yunhe Wang · Deyu Meng · Jingdong Wang · Chongyi Li -
2021 : Online Learning via Linear Programming, Yinyu Ye »
Yinyu Ye -
2021 Poster: Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond »
Pan Zhou · Hanshu Yan · Xiaotong Yuan · Jiashi Feng · Shuicheng Yan -
2021 Poster: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2020 Poster: Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning »
Pan Zhou · Jiashi Feng · Chao Ma · Caiming Xiong · Steven Chu Hong Hoi · Weinan E -
2020 Poster: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Poster: Simple and Fast Algorithm for Binary Integer and Online Linear Programming »
Xiaocheng Li · Chunlin Sun · Yinyu Ye -
2020 Oral: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Poster: Conic Descent and its Application to Memory-efficient Optimization over Positive Semidefinite Matrices »
John Duchi · Oliver Hinder · Andrew Naber · Yinyu Ye -
2020 Poster: Improving GAN Training with Probability Ratio Clipping and Sample Reweighting »
Yue Wu · Pan Zhou · Andrew Wilson · Eric Xing · Zhiting Hu -
2020 Poster: Distributionally Robust Local Non-parametric Conditional Estimation »
Viet Anh Nguyen · Fan Zhang · Jose Blanchet · Erick Delage · Yinyu Ye -
2019 : Poster and Coffee Break 2 »
Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren Amdahl-Culleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · Marius-Constantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Nikolaus Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · Pierre-Luc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Wai Lok Li · Siddhartha Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · Xiao-Wen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · Chen-Yu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · SIDDHARTHA SATPATHI · Xueqing Liu · Andreu Vall -
2019 : Poster Spotlight 2 »
Aaron Sidford · Mengdi Wang · Lin Yang · Yinyu Ye · Zuyue Fu · Zhuoran Yang · Yongxin Chen · Zhaoran Wang · Ofir Nachum · Bo Dai · Ilya Kostrikov · Dale Schuurmans · Ziyang Tang · Yihao Feng · Lihong Li · Denny Zhou · Qiang Liu · Rodrigo Toro Icarte · Ethan Waldie · Toryn Klassen · Rick Valenzano · Margarita Castro · Simon Du · Sham Kakade · Ruosong Wang · Minshuo Chen · Tianyi Liu · Xingguo Li · Zhaoran Wang · Tuo Zhao · Philip Amortila · Doina Precup · Prakash Panangaden · Marc Bellemare -
2019 Poster: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2019 Spotlight: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2019 Poster: Interior-Point Methods Strike Back: Solving the Wasserstein Barycenter Problem »
DongDong Ge · Haoyue Wang · Zikai Xiong · Yinyu Ye -
2018 Poster: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: Efficient Stochastic Gradient Hard Thresholding »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model »
Aaron Sidford · Mengdi Wang · Xian Wu · Lin Yang · Yinyu Ye