Timezone: »
On the Maximum Hessian Eigenvalue and Generalization
Simran Kaur · Jeremy M Cohen · Zachary Lipton
Event URL: https://openreview.net/forum?id=llza_S8mHT- »
The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM), that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.
The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM), that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.
Author Information
Simran Kaur (Princeton University)
Jeremy M Cohen (Carnegie Mellon University)
Zachary Lipton (Carnegie Mellon University)
More from the Same Authors
-
2021 : Model-Free Learning for Continuous Timing as an Action »
Helen Zhou · David Childers · Zachary Lipton -
2022 : Downstream Datasets Make Surprisingly Good Pretraining Corpora »
Kundan Krishna · Saurabh Garg · Jeffrey Bigham · Zachary Lipton -
2022 : Disentangling the Mechanisms Behind Implicit Regularization in SGD »
Zachary Novack · Simran Kaur · Tanya Marwah · Saurabh Garg · Zachary Lipton -
2022 : RLSBench: A Large-Scale Empirical Study of Domain Adaptation Under Relaxed Label Shift »
Saurabh Garg · Nick Erickson · James Sharpnack · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton -
2022 : Local Causal Discovery for Estimating Causal Effects »
Shantanu Gupta · David Childers · Zachary Lipton -
2022 : Panel on Technical Challenges Associated with Reliable Human Evaluations of Generative Models »
Long Ouyang · Tongshuang Wu · Zachary Lipton -
2022 Workshop: Human Evaluation of Generative Models »
Divyansh Kaushik · Jennifer Hsia · Jessica Huynh · Yonadav Shavit · Samuel Bowman · Ting-Hao Huang · Douwe Kiela · Zachary Lipton · Eric Michael Smith -
2022 Poster: Characterizing Datapoints via Second-Split Forgetting »
Pratyush Maini · Saurabh Garg · Zachary Lipton · J. Zico Kolter -
2022 Poster: Unsupervised Learning under Latent Label Shift »
Manley Roberts · Pranav Mani · Saurabh Garg · Zachary Lipton -
2022 Poster: Domain Adaptation under Open Set Label Shift »
Saurabh Garg · Sivaraman Balakrishnan · Zachary Lipton -
2020 : Contributed Talk 1: Fairness Under Partial Compliance »
Jessica Dai · Zachary Lipton -
2020 : Q & A and Panel Session with Tom Mitchell, Jenn Wortman Vaughan, Sanjoy Dasgupta, and Finale Doshi-Velez »
Tom Mitchell · Jennifer Wortman Vaughan · Sanjoy Dasgupta · Finale Doshi-Velez · Zachary Lipton -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2020 Poster: A Unified View of Label Shift Estimation »
Saurabh Garg · Yifan Wu · Sivaraman Balakrishnan · Zachary Lipton -
2019 Poster: Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift »
Stephan Rabanser · Stephan Günnemann · Zachary Lipton -
2019 Poster: Learning Robust Global Representations by Penalizing Local Predictive Power »
Haohan Wang · Songwei Ge · Zachary Lipton · Eric Xing -
2019 Poster: Game Design for Eliciting Distinguishable Behavior »
Fan Yang · Liu Leqi · Yifan Wu · Zachary Lipton · Pradeep Ravikumar · Tom M Mitchell · William Cohen -
2018 : Invited Talk 1 »
Zachary Lipton -
2018 : Panel on research process »
Zachary Lipton · Charles Sutton · Finale Doshi-Velez · Hanna Wallach · Suchi Saria · Rich Caruana · Thomas Rainforth -
2018 : Zachary Lipton »
Zachary Lipton -
2018 Poster: Does mitigating ML's impact disparity require treatment disparity? »
Zachary Lipton · Julian McAuley · Alexandra Chouldechova