Adam, Gauss–Newton, and What’s Left on the Table for Second-Order Preconditioning, Sham Kakade
Abstract
Large-scale deep learning is dominated by first-order or diagonal methods such as Adam, SOAP, and Muon, even though training objectives exhibit rich second-order structure. This talk asks: how much performance is being left on the table by these approximations, and which properties of the preconditioner actually matter? The first part of the talk uses full Gauss–Newton (GN) preconditioning as an “oracle” to study LLM pretraining, showing that full GN (and even precise layerwise GN) can reduce the number of training iterations by more than a factor of five compared to strong diagonal baselines, highlighting a substantial gap between current practice and an idealized second-order method. The second part compares Adam-style and GN-style diagonal preconditioners through the lenses of basis alignment and SGD noise, identifying regimes where each is provably and empirically superior. The talk concludes with design principles and open questions for scalable second-order–inspired optimizers.
Work done with: Natalie Abreu, Rachit Bansal, Bingbin Liu, David Alvarez-Melis, Depen Morwani & Nikhil Vyas.