Timezone: »
Traditional analyses in non-convex optimization typically rely on the smoothness assumption, namely requiring the gradients to be Lipschitz. However, recent evidence shows that this smoothness condition does not capture the properties of some deep learning objective functions, including the ones involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this relaxed assumption, it has been theoretically and empirically shown that the gradient-clipped SGD has an advantage over the vanilla one. In this paper, we show that clipping is not indispensable for Adam-type algorithms in tackling such scenarios: we theoretically prove that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all. This family of algorithms on one end recovers SignSGD and on the other end closely resembles the popular Adam algorithm. Our analysis underlines the critical role that momentum plays in analyzing SignSGD-type and Adam-type algorithms: it not only reduces the effects of noise, thus removing the need for large mini-batch in previous analyses of SignSGD-type algorithms, but it also substantially reduces the effects of unbounded smoothness and gradient norms. To the best of our knowledge, this work is the first one showing the benefit of Adam-type algorithms compared with non-adaptive gradient algorithms such as gradient descent in the unbounded smoothness setting. We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating others.
Author Information
Michael Crawshaw (George Mason University)
Mingrui Liu (George Mason University)
Francesco Orabona (Boston University)
Wei Zhang (IBM T.J.Watson Research Center)
BE Beijing Univ of Technology 2005 MSc Technical University of Denmark 2008 PhD University of Wisconsin, Madison 2013 All in computer science Published papers in ASPLOS, OOPSLA, OSDI, PLDI, IJCAI, ICDM, NIPS
Zhenxun Zhuang (Meta)
More from the Same Authors
-
2021 : CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks »
Ruchir Puri · David Kung · Geert Janssen · Wei Zhang · Giacomo Domeniconi · Vladimir Zolotov · Julian T Dolby · Jie Chen · Mihir Choudhury · Lindsey Decker · Veronika Thost · Luca Buratti · Saurabh Pujar · Shyam Ramji · Ulrich Finkler · Susan Malaika · Frederick Reiss -
2023 Poster: Global Convergence Analysis of Local SGD for One-hidden-layer Convolutional Neural Network without Overparameterization »
Yajie Bao · Amarda Shehu · Mingrui Liu -
2023 Poster: Federated Learning with Client Subsampling, Data Heterogeneity, and Unbounded Smoothness: A New Algorithm and Lower Bounds »
Michael Crawshaw · Yajie Bao · Mingrui Liu -
2023 Poster: Bilevel Coreset Selection in Continual Learning: A New Formulation and Algorithm »
Jie Hao · Kaiyi Ji · Mingrui Liu -
2022 Spotlight: A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks »
Mingrui Liu · Zhenxun Zhuang · Yunwen Lei · Chunyang Liao -
2022 Spotlight: Will Bilevel Optimizers Benefit from Loops »
Kaiyi Ji · Mingrui Liu · Yingbin Liang · Lei Ying -
2022 Poster: A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks »
Mingrui Liu · Zhenxun Zhuang · Yunwen Lei · Chunyang Liao -
2022 Poster: Will Bilevel Optimizers Benefit from Loops »
Kaiyi Ji · Mingrui Liu · Yingbin Liang · Lei Ying -
2021 Poster: Minimax Optimal Quantile and Semi-Adversarial Regret via Root-Logarithmic Regularizers »
Jeffrey Negrea · Blair Bilodeau · Nicolò Campolongo · Francesco Orabona · Dan Roy -
2021 Poster: Generalization Guarantee of SGD for Pairwise Learning »
Yunwen Lei · Mingrui Liu · Yiming Ying -
2020 Poster: Improved Schemes for Episodic Memory-based Lifelong Learning »
Yunhui Guo · Mingrui Liu · Tianbao Yang · Tajana S Rosing -
2020 Spotlight: Improved Schemes for Episodic Memory-based Lifelong Learning »
Yunhui Guo · Mingrui Liu · Tianbao Yang · Tajana S Rosing -
2020 Poster: A Decentralized Parallel Algorithm for Training Generative Adversarial Nets »
Mingrui Liu · Wei Zhang · Youssef Mroueh · Xiaodong Cui · Jarret Ross · Tianbao Yang · Payel Das -
2020 Poster: ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training »
Chia-Yu Chen · Jiamin Ni · Songtao Lu · Xiaodong Cui · Pin-Yu Chen · Xiao Sun · Naigang Wang · Swagath Venkataramani · Vijayalakshmi (Viji) Srinivasan · Wei Zhang · Kailash Gopalakrishnan -
2020 Poster: Temporal Variability in Implicit Online Learning »
Nicolò Campolongo · Francesco Orabona -
2019 : Coffee/Poster session 2 »
Xingyou Song · Puneet Mangla · David Salinas · Zhenxun Zhuang · Leo Feng · Shell Xu Hu · Raul Puri · Wesley Maddox · Aniruddh Raghu · Prudencio Tossou · Mingzhang Yin · Ishita Dasgupta · Kangwook Lee · Ferran Alet · Zhen Xu · Jörg Franke · James Harrison · Jonathan Warrell · Guneet Dhillon · Arber Zela · Xin Qiu · Julien Niklas Siems · Russell Mendonca · Louis Schlessinger · Jeffrey Li · Georgiana Manolache · Debojyoti Dutta · Lucas Glass · Abhishek Singh · Gregor Koehler -
2019 Poster: Momentum-Based Variance Reduction in Non-Convex SGD »
Ashok Cutkosky · Francesco Orabona -
2019 Poster: Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration »
Kwang-Sung Jun · Ashok Cutkosky · Francesco Orabona -
2019 Poster: Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks »
Xiao Sun · Jungwook Choi · Chia-Yu Chen · Naigang Wang · Swagath Venkataramani · Vijayalakshmi (Viji) Srinivasan · Xiaodong Cui · Wei Zhang · Kailash Gopalakrishnan -
2018 Poster: Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks »
Xiaodong Cui · Wei Zhang · Zoltán Tüske · Michael Picheny -
2017 Poster: Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent »
Xiangru Lian · Ce Zhang · Huan Zhang · Cho-Jui Hsieh · Wei Zhang · Ji Liu -
2017 Oral: Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent »
Xiangru Lian · Ce Zhang · Huan Zhang · Cho-Jui Hsieh · Wei Zhang · Ji Liu -
2017 Poster: Training Deep Networks without Learning Rates Through Coin Betting »
Francesco Orabona · Tatiana Tommasi -
2016 Poster: Coin Betting and Parameter-Free Online Learning »
Francesco Orabona · David Pal -
2014 Workshop: Second Workshop on Transfer and Multi-Task Learning: Theory meets Practice »
Urun Dogan · Tatiana Tommasi · Yoshua Bengio · Francesco Orabona · Marius Kloft · Andres Munoz · Gunnar Rätsch · Hal Daumé III · Mehryar Mohri · Xuezhi Wang · Daniel Hernández-lobato · Song Liu · Thomas Unterthiner · Pascal Germain · Vinay P Namboodiri · Michael Goetz · Christopher Berlind · Sigurd Spieckermann · Marta Soare · Yujia Li · Vitaly Kuznetsov · Wenzhao Lian · Daniele Calandriello · Emilie Morvant -
2014 Workshop: Modern Nonparametrics 3: Automating the Learning Pipeline »
Eric Xing · Mladen Kolar · Arthur Gretton · Samory Kpotufe · Han Liu · Zoltán Szabó · Alan Yuille · Andrew G Wilson · Ryan Tibshirani · Sasha Rakhlin · Damian Kozbur · Bharath Sriperumbudur · David Lopez-Paz · Kirthevasan Kandasamy · Francesco Orabona · Andreas Damianou · Wacha Bounliphone · Yanshuai Cao · Arijit Das · Yingzhen Yang · Giulia DeSalvo · Dmitry Storcheus · Roberto Valerio -
2014 Poster: Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning »
Francesco Orabona -
2013 Workshop: New Directions in Transfer and Multi-Task: Learning Across Domains and Tasks »
Urun Dogan · Marius Kloft · Tatiana Tommasi · Francesco Orabona · Massimiliano Pontil · Sinno Jialin Pan · Shai Ben-David · Arthur Gretton · Fei Sha · Marco Signoretto · Rajhans Samdani · Yun-Qian Miao · Mohammad Gheshlaghi azar · Ruth Urner · Christoph Lampert · Jonathan How -
2013 Poster: Dimension-Free Exponentiated Gradient »
Francesco Orabona -
2013 Spotlight: Dimension-Free Exponentiated Gradient »
Francesco Orabona -
2013 Poster: Regression-tree Tuning in a Streaming Setting »
Samory Kpotufe · Francesco Orabona -
2013 Spotlight: Regression-tree Tuning in a Streaming Setting »
Samory Kpotufe · Francesco Orabona -
2012 Poster: On Multilabel Classification and Ranking with Partial Feedback »
Claudio Gentile · Francesco Orabona -
2012 Spotlight: On Multilabel Classification and Ranking with Partial Feedback »
Claudio Gentile · Francesco Orabona -
2010 Poster: New Adaptive Algorithms for Online Classification »
Francesco Orabona · Yacov Crammer -
2010 Spotlight: Learning from Candidate Labeling Sets »
Jie Luo · Francesco Orabona -
2010 Poster: Learning from Candidate Labeling Sets »
Jie Luo · Francesco Orabona -
2009 Workshop: Learning from Multiple Sources with Applications to Robotics »
Barbara Caputo · Nicolò Cesa-Bianchi · David R Hardoon · Gayle Leen · Francesco Orabona · Jaakko Peltonen · Simon Rogers