Timezone: »
We propose a framework for online meta-optimization of parameters that govern optimization, called Amortized Proximal Optimization (APO). We first interpret various existing neural network optimizers as approximate stochastic proximal point methods which trade off the current-batch loss with proximity terms in both function space and weight space. The idea behind APO is to amortize the minimization of the proximal point objective by meta-learning the parameters of an update rule. We show how APO can be used to adapt a learning rate or a structured preconditioning matrix. Under appropriate assumptions, APO can recover existing optimizers such as natural gradient descent and KFAC. It enjoys low computational overhead and avoids expensive and numerically sensitive operations required by some second-order optimizers, such as matrix inverses. We empirically test APO for online adaptation of learning rates and structured preconditioning matrices for regression, image reconstruction, image classification, and natural language translation tasks. Empirically, the learning rate schedules found by APO generally outperform optimal fixed learning rates and are competitive with manually tuned decay schedules. Using APO to adapt a structured preconditioning matrix generally results in optimization performance competitive with second-order methods. Moreover, the absence of matrix inversion provides numerical stability, making it effective for low-precision training.
Author Information
Juhan Bae (University of Toronto, Vector Institute)
Paul Vicol (University of Toronto)
Jeff Z. HaoChen (Stanford University)
Roger Grosse (University of Toronto)
More from the Same Authors
-
2021 : Self-supervised Learning is More Robust to Dataset Imbalance »
Hong Liu · Jeff Z. HaoChen · Adrien Gaidon · Tengyu Ma -
2022 : DrML: Diagnosing and Rectifying Vision Models using Language »
Yuhui Zhang · Jeff Z. HaoChen · Shih-Cheng Huang · Kuan-Chieh Wang · James Zou · Serena Yeung -
2022 : DrML: Diagnosing and Rectifying Vision Models using Language »
Yuhui Zhang · Jeff Z. HaoChen · Shih-Cheng Huang · Kuan-Chieh Wang · James Zou · Serena Yeung -
2022 Poster: Proximal Learning With Opponent-Learning Awareness »
Stephen Zhao · Chris Lu · Roger Grosse · Jakob Foerster -
2022 Poster: If Influence Functions are the Answer, Then What is the Question? »
Juhan Bae · Nathan Ng · Alston Lo · Marzyeh Ghassemi · Roger Grosse -
2022 Poster: Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations »
Jeff Z. HaoChen · Colin Wei · Ananya Kumar · Tengyu Ma -
2022 Poster: Path Independent Equilibrium Models Can Better Exploit Test-Time Computation »
Cem Anil · Ashwini Pokle · Kaiqu Liang · Johannes Treutlein · Yuhuai Wu · Shaojie Bai · J. Zico Kolter · Roger Grosse -
2021 Oral: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss »
Jeff Z. HaoChen · Colin Wei · Adrien Gaidon · Tengyu Ma -
2021 Poster: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss »
Jeff Z. HaoChen · Colin Wei · Adrien Gaidon · Tengyu Ma -
2021 Poster: Differentiable Annealed Importance Sampling and the Perils of Gradient Noise »
Guodong Zhang · Kyle Hsu · Jianing Li · Chelsea Finn · Roger Grosse -
2020 : Invited Talk: Roger Grosse - Why Isn’t Everyone Using Second-Order Optimization? »
Roger Grosse -
2020 Poster: Delta-STN: Efficient Bilevel Optimization for Neural Networks using Structured Response Jacobians »
Juhan Bae · Roger Grosse -
2020 Poster: Regularized linear autoencoders recover the principal components, eventually »
Xuchan Bao · James Lucas · Sushant Sachdeva · Roger Grosse -
2019 Poster: Fast Convergence of Natural Gradient Descent for Over-Parameterized Neural Networks »
Guodong Zhang · James Martens · Roger Grosse -
2019 Poster: Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model »
Guodong Zhang · Lala Li · Zachary Nado · James Martens · Sushant Sachdeva · George Dahl · Chris Shallue · Roger Grosse -
2019 Poster: Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks »
Qiyang Li · Saminul Haque · Cem Anil · James Lucas · Roger Grosse · Joern-Henrik Jacobsen -
2019 Poster: Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse »
James Lucas · George Tucker · Roger Grosse · Mohammad Norouzi -
2018 Poster: Isolating Sources of Disentanglement in Variational Autoencoders »
Tian Qi Chen · Xuechen (Chen) Li · Roger Grosse · David Duvenaud -
2018 Oral: Isolating Sources of Disentanglement in Variational Autoencoders »
Tian Qi Chen · Xuechen (Chen) Li · Roger Grosse · David Duvenaud -
2018 Poster: Reversible Recurrent Neural Networks »
Matthew MacKay · Paul Vicol · Jimmy Ba · Roger Grosse -
2017 Poster: Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation »
Yuhuai Wu · Elman Mansimov · Roger Grosse · Shun Liao · Jimmy Ba -
2017 Spotlight: Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation »
Yuhuai Wu · Elman Mansimov · Roger Grosse · Shun Liao · Jimmy Ba -
2017 Poster: The Reversible Residual Network: Backpropagation Without Storing Activations »
Aidan Gomez · Mengye Ren · Raquel Urtasun · Roger Grosse -
2016 Symposium: Deep Learning Symposium »
Yoshua Bengio · Yann LeCun · Navdeep Jaitly · Roger Grosse -
2016 Poster: Measuring the reliability of MCMC inference with bidirectional Monte Carlo »
Roger Grosse · Siddharth Ancha · Daniel Roy -
2015 Poster: Learning Wake-Sleep Recurrent Attention Models »
Jimmy Ba · Russ Salakhutdinov · Roger Grosse · Brendan J Frey -
2015 Spotlight: Learning Wake-Sleep Recurrent Attention Models »
Jimmy Ba · Russ Salakhutdinov · Roger Grosse · Brendan J Frey -
2013 Poster: Annealing between distributions by averaging moments »
Roger Grosse · Chris Maddison · Russ Salakhutdinov -
2013 Oral: Annealing between distributions by averaging moments »
Roger Grosse · Chris Maddison · Russ Salakhutdinov