Timezone: »

 
Workshop
Beyond first order methods in machine learning systems
Anastasios Kyrillidis · Albert Berahas · Fred Roosta · Michael W Mahoney

Fri Dec 13 08:00 AM -- 06:00 PM (PST) @ West 211 - 214
Event URL: https://sites.google.com/site/optneurips19/ »

Optimization lies at the heart of many exciting developments in machine learning, statistics and signal processing. As models become more complex and datasets get larger, finding efficient, reliable and provable methods is one of the primary goals in these fields.

In the last few decades, much effort has been devoted to the development of first-order methods. These methods enjoy a low per-iteration cost and have optimal complexity, are easy to implement, and have proven to be effective for most machine learning applications. First-order methods, however, have significant limitations: (1) they require fine hyper-parameter tuning, (2) they do not incorporate curvature information, and thus are sensitive to ill-conditioning, and (3) they are often unable to fully exploit the power of distributed computing architectures.

Higher-order methods, such as Newton, quasi-Newton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of ill-conditioning, they avoid or diminish the need for hyper-parameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higher-order methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higher-order methods are “undervalued.”

This workshop will attempt to shed light on this statement. Topics of interest include --but are not limited to-- second-order methods, adaptive gradient descent methods, regularization techniques, as well as techniques based on higher-order derivatives.

Fri 8:00 a.m. - 8:30 a.m. [iCal]

Opening remarks for the workshop by the organizers

Tasos Kyrillidis, Albert Berahas, Fred Roosta, Michael W Mahoney
Fri 8:30 a.m. - 9:15 a.m. [iCal]

Stochastic gradient descent (SGD) and variants such as Adagrad and Adam, are extensively used today to train modern machine learning models. In this talk we will discuss ways to economically use second-order information to modify both the step size (learning rate) used in SGD and the direction taken by SGD. Our methods adaptively control the batch sizes used to compute gradient and Hessian approximations and and ensure that the steps that are taken decrease the loss function with high probability assuming that the latter is self-concordant, as is true for many problems in empirical risk minimization. For such cases we prove that our basic algorithm is globally linearly convergent. A slightly modified version of our method is presented for training deep learning models. Numerical results will be presented that show that it exhibits excellent performance without the need for learning rate tuning. If there is time, additional ways to efficiently make use of second-order information will be presented.

Donald Goldfarb
Fri 9:00 a.m. - 9:45 a.m. [iCal]

How does mini-batching affect Curvature information for second order deep learning optimization? Diego Granziol (Oxford); Stephen Roberts (Oxford); Xingchen Wan (Oxford University); Stefan Zohren (University of Oxford); Binxin Ru (University of Oxford); Michael A. Osborne (University of Oxford); Andrew Wilson (NYU); sebastien ehrhardt (Oxford); Dmitry P Vetrov (Higher School of Economics); Timur Garipov (Samsung AI Center in Moscow)

Acceleration through Spectral Modeling. Fabian Pedregosa (Google); Damien Scieur (Princeton University)

Using better models in stochastic optimization. Hilal Asi (Stanford University); John Duchi (Stanford University)

Diego Granziol, Fabian Pedregosa, Hilal Asi
Fri 9:45 a.m. - 10:30 a.m. [iCal]

Poster Session

Eduard Gorbunov, Alexandre d'Aspremont, Lingxiao Wang, Liwei Wang, Boris Ginsburg, Alessio Quaglino, Camille Castera, Saurabh Adya, Diego Granziol, Rudrajit Das, Raghu Bollapragada, Fabian Pedregosa, Martin Takac, Majid Jahani, Sai Praneeth Karimireddy, Hilal Asi, Balint Daroczy, Leonard Adolphs, Aditya Rawal, Nicolas Brandt, Minhan Li, Giuseppe Ughi, Orlando Romero, Ivan Skorokhodov, Damien Scieur, Kiwook Bae, Konstantin Mishchenko, Rohan Anil, Vatsal Sharan, Aditya Balu, Chao Chen, Zhewei Yao, Tolga Ergen, Paul Grigas, Chris Junchi Li, Jimmy Ba, Stephen J Roberts, Sharan Vaswani, Armin Eftekhari, Chhavi Sharma
Fri 10:30 a.m. - 11:15 a.m. [iCal]

Adaptive gradient methods have had a transformative impact in deep learning. We will describe recent theoretical and experimental advances in their understanding, including low-memory adaptive preconditioning, and insights into their generalizaton ability.

Fri 11:15 a.m. - 12:00 p.m. [iCal]

Symmetric Multisecant quasi-Newton methods. Damien Scieur (Samsung AI Research Montreal); Thomas Pumir (Princeton University); Nicolas Boumal (Princeton University)

Stochastic Newton Method and its Cubic Regularization via Majorization-Minimization. Konstantin Mishchenko (King Abdullah University of Science & Technology (KAUST)); Peter Richtarik (KAUST); Dmitry Koralev (KAUST)

Full Matrix Preconditioning Made Practical. Rohan Anil (Google); Vineet Gupta (Google); Tomer Koren (Google); Kevin Regan (Google); Yoram Singer (Princeton)

Damien Scieur, Konstantin Mishchenko, Rohan Anil
Fri 12:00 p.m. - 2:00 p.m. [iCal]
Lunch break
Fri 2:00 p.m. - 2:45 p.m. [iCal]

Second order optimization methods have the potential to be much faster than first order methods in the deterministic case, or pre-asymptotically in the stochastic case. However, traditional second order methods have proven ineffective or impractical for neural network training, due in part to the extremely high dimension of the parameter space. Kronecker-factored Approximate Curvature (K-FAC) is second-order optimization method based on a tractable approximation to the Gauss-Newton/Fisher matrix that exploits the special structure present in neural network training objectives. This approximation is neither low-rank nor diagonal, but instead involves Kronecker-products, which allows for efficient estimation, storage and inversion of the curvature matrix. In this talk I will introduce the basic K-FAC method for standard MLPs and then present some more recent work in this direction, including extensions to CNNs and RNNs, both of which requires new approximations to the Fisher. For these I will provide mathematical intuitions and empirical results which speak to their efficacy in neural network optimization. Time permitting, I will also discuss some recent results on large-batch optimization with K-FAC, and the use of adaptive adjustment methods that can eliminate the need for costly hyperparameter tuning.

James Martens
Fri 2:45 p.m. - 3:30 p.m. [iCal]

Hessian-Aware trace-Weighted Quantization. Zhen Dong (UC Berkeley); Zhewei Yao (University of California, Berkeley); Amir Gholami (UC Berkeley); Yaohui Cai (Peking University); Daiyaan Arfeen (UC Berkeley); Michael Mahoney ("University of California, Berkeley"); Kurt Keutzer (UC Berkeley)

New Methods for Regularization Path Optimization via Differential Equations. Paul Grigas (UC Berkeley); Heyuan Liu (University of California, Berkeley)

Ellipsoidal Trust Region Methods for Neural Nets. Leonard Adolphs (ETHZ); Jonas Kohler (ETHZ)

Sub-sampled Newton Methods Under Interpolation. Si Yi Meng (University of British Columbia); Sharan Vaswani (Mila, Université de Montréal); Issam Laradji (University of British Columbia); Mark Schmidt (University of British Columbia); Simon Lacoste-Julien (Mila, Université de Montréal)

Paul Grigas, Zhewei Yao, Aurelien Lucchi, Si Yi (Cathy) Meng
Fri 3:30 p.m. - 4:15 p.m. [iCal]

An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization. Eduard Gorbunov (Moscow Institute of Physics and Technology); Pavel Dvurechenskii (WIAS Germany); Alexander Gasnikov (Moscow Institute of Physics and Technology)

Fast Bregman Gradient Methods for Low-Rank Minimization Problems. Radu-Alexandru Dragomir (Université Toulouse 1); Jérôme Bolte (Université Toulouse 1); Alexandre d'Aspremont (Ecole Normale Superieure)

Gluster: Variance Reduced Mini-Batch SGD with Gradient Clustering. Fartash Faghri (University of Toronto); David Duvenaud (University of Toronto); David Fleet (University of Toronto); Jimmy Ba (University of Toronto)

Neural Policy Gradient Methods: Global Optimality and Rates of Convergence. Lingxiao Wang (Northwestern University); Qi Cai (Northwestern University); Zhuoran Yang (Princeton University); Zhaoran Wang (Northwestern University)

A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems. Tianle Cai (Peking University); Ruiqi Gao (Peking University); Jikai Hou (Peking University); Siyu Chen (Peking University); Dong Wang (Peking University); Di He (Peking University); Zhihua Zhang (Peking University); Liwei Wang (Peking University)

Stochastic Gradient Methods with Layerwise Adaptive Moments for Training of Deep Networks. Boris Ginsburg (NVIDIA); Oleksii Hrinchuk (NVIDIA); Jason Li (NVIDIA); Vitaly Lavrukhin (NVIDIA); Ryan Leary (NVIDIA); Oleksii Kuchaiev (NVIDIA); Jonathan Cohen (NVIDIA); Huyen Nguyen (NVIDIA); Yang Zhang (NVIDIA)

Accelerating Neural ODEs with Spectral Elements. Alessio Quaglino (NNAISENSE SA); Marco Gallieri (NNAISENSE); Jonathan Masci (NNAISENSE); Jan Koutnik (NNAISENSE)

An Inertial Newton Algorithm for Deep Learning. Camille Castera (CNRS, IRIT); Jérôme Bolte (Université Toulouse 1); Cédric Févotte (CNRS, IRIT); Edouard Pauwels (Toulouse 3 University)

Nonlinear Conjugate Gradients for Scaling Synchronous Distributed DNN Training. Saurabh Adya (Apple); Vinay Palakkode (Apple Inc.); Oncel Tuzel (Apple Inc.)

  • How does mini-batching affect Curvature information for second order deep learning optimization? Diego Granziol (Oxford); Stephen Roberts (Oxford); Xingchen Wan (Oxford University); Stefan Zohren (University of Oxford); Binxin Ru (University of Oxford); Michael A. Osborne (University of Oxford); Andrew Wilson (NYU); sebastien ehrhardt (Oxford); Dmitry P Vetrov (Higher School of Economics); Timur Garipov (Samsung AI Center in Moscow)

On the Convergence of a Biased Version of Stochastic Gradient Descent. Rudrajit Das (University of Texas at Austin); Jiong Zhang (UT-Austin); Inderjit S. Dhillon (UT Austin & Amazon)

Adaptive Sampling Quasi-Newton Methods for Derivative-Free Stochastic Optimization. Raghu Bollapragada (Argonne National Laboratory); Stefan Wild (Argonne National Laboratory)

  • Acceleration through Spectral Modeling. Fabian Pedregosa (Google); Damien Scieur (Princeton University)

Accelerating Distributed Stochastic L-BFGS by sampled 2nd-Order Information. Jie Liu (Lehigh University); Yu Rong (Tencent AI Lab); Martin Takac (Lehigh University); Junzhou Huang (Tencent AI Lab)

Grow Your Samples and Optimize Better via Distributed Newton CG and Accumulating Strategy. Majid Jahani (Lehigh University); Xi He (Lehigh University); Chenxin Ma (Lehigh University); Aryan Mokhtari (UT Austin); Dheevatsa Mudigere (Intel Labs); Alejandro Ribeiro (University of Pennsylvania); Martin Takac (Lehigh University)

Global linear convergence of trust-region Newton's method without strong-convexity or smoothness. Sai Praneeth Karimireddy (EPFL); Sebastian Stich (EPFL); Martin Jaggi (EPFL)

FD-Net with Auxiliary Time Steps: Fast Prediction of PDEs using Hessian-Free Trust-Region Methods. Nur Sila Gulgec (Lehigh University); Zheng Shi (Lehigh University); Neil Deshmukh (MIT BeaverWorks - Medlytics); Shamim Pakzad (Lehigh University); Martin Takac (Lehigh University)

  • Using better models in stochastic optimization. Hilal Asi (Stanford University); John Duchi (Stanford University)

Tangent space separability in feedforward neural networks. Bálint Daróczy (Institute for Computer Science and Control, Hungarian Academy of Sciences); Rita Aleksziev (Institute for Computer Science and Control, Hungarian Academy of Sciences); Andras Benczur (Hungarian Academy of Sciences)

  • Ellipsoidal Trust Region Methods for Neural Nets. Leonard Adolphs (ETHZ); Jonas Kohler (ETHZ)

Closing the K-FAC Generalisation Gap Using Stochastic Weight Averaging. Xingchen Wan (University of Oxford); Diego Granziol (Oxford); Stefan Zohren (University of Oxford); Stephen Roberts (Oxford)

  • Sub-sampled Newton Methods Under Interpolation. Si Yi Meng (University of British Columbia); Sharan Vaswani (Mila, Université de Montréal); Issam Laradji (University of British Columbia); Mark Schmidt (University of British Columbia); Simon Lacoste-Julien (Mila, Université de Montréal)

Learned First-Order Preconditioning. Aditya Rawal (Uber AI Labs); Rui Wang (Uber AI); Theodore Moskovitz (Gatsby Computational Neuroscience Unit); Sanyam Kapoor (Uber); Janice Lan (Uber AI); Jason Yosinski (Uber AI Labs); Thomas Miconi (Uber AI Labs)

Iterative Hessian Sketch in Input Sparsity Time. Charlie Dickens (University of Warwick); Graham Cormode (University of Warwick)

Nonlinear matrix recovery. Florentin Goyens (University of Oxford); Coralia Cartis (Oxford University); Armin Eftekhari (EPFL)

Making Variance Reduction more Effective for Deep Networks. Nicolas Brandt (EPFL); Farnood Salehi (EPFL); Patrick Thiran (EPFL)

Novel and Efficient Approximations for Zero-One Loss of Linear Classifiers. Hiva Ghanbari (Lehigh University); Minhan Li (Lehigh University); Katya Scheinberg (Lehigh)

A Model-Based Derivative-Free Approach to Black-Box Adversarial Examples: BOBYQA. Giuseppe Ughi (University of Oxford)

Distributed Accelerated Inexact Proximal Gradient Method via System of Coupled Ordinary Differential Equations. Chhavi Sharma (IIT Bombay); Vishnu Narayanan (IIT Bombay); Balamurugan Palaniappan (IIT Bombay)

Finite-Time Convergence of Continuous-Time Optimization Algorithms via Differential Inclusions. Orlando Romero (Rensselaer Polytechnic Institute); Mouhacine Benosman (MERL)

Loss Landscape Sightseeing by Multi-Point Optimization. Ivan Skorokhodov (MIPT); Mikhail Burtsev (NI)

  • Symmetric Multisecant quasi-Newton methods. Damien Scieur (Samsung AI Research Montreal); Thomas Pumir (Princeton University); Nicolas Boumal (Princeton University)

Does Adam optimizer keep close to the optimal point? Kiwook Bae (KAIST)*; Heechang Ryu (KAIST); Hayong Shin (KAIST)

  • Stochastic Newton Method and its Cubic Regularization via Majorization-Minimization. Konstantin Mishchenko (King Abdullah University of Science & Technology (KAUST)); Peter Richtarik (KAUST); Dmitry Koralev (KAUST)

  • Full Matrix Preconditioning Made Practical. Rohan Anil (Google); Vineet Gupta (Google); Tomer Koren (Google); Kevin Regan (Google); Yoram Singer (Princeton)

Memory-Sample Tradeoffs for Linear Regression with Small Error. Vatsal Sharan (Stanford University); Aaron Sidford (Stanford); Gregory Valiant (Stanford University)

On the Higher-order Moments in Adam. Zhanhong Jiang (Johnson Controls International); Aditya Balu (Iowa State University); Sin Yong Tan (Iowa State University); Young M Lee (Johnson Controls International); Chinmay Hegde (Iowa State University); Soumik Sarkar (Iowa State University)

h-matrix approximation for Gauss-Newton Hessian. Chao Chen (UT Austin)

  • Hessian-Aware trace-Weighted Quantization. Zhen Dong (UC Berkeley); Zhewei Yao (University of California, Berkeley); Amir Gholami (UC Berkeley); Yaohui Cai (Peking University); Daiyaan Arfeen (UC Berkeley); Michael Mahoney ("University of California, Berkeley"); Kurt Keutzer (UC Berkeley)

Random Projections for Learning Non-convex Models. Tolga Ergen (Stanford University); Emmanuel Candes (Stanford University); Mert Pilanci (Stanford)

  • New Methods for Regularization Path Optimization via Differential Equations. Paul Grigas (UC Berkeley); Heyuan Liu (University of California, Berkeley)

Hessian-Aware Zeroth-Order Optimization. Haishan Ye (HKUST); Zhichao Huang (HKUST); Cong Fang (Peking University); Chris Junchi Li (Tencent); Tong Zhang (HKUST)

Higher-Order Accelerated Methods for Faster Non-Smooth Optimization. Brian Bullins (TTIC)

Fri 4:15 p.m. - 5:00 p.m. [iCal]

We develop convergence analysis of a modified line search method for objective functions whose value is computed with noise and whose gradient estimates are not directly available. The noise is assumed to be bounded in absolute value without any additional assumptions. In this case, gradient approximation can be constructed via interpolation or sample average approximation of smoothing gradients and thus they are always inexact and possibly random. We extend the framework based on stochastic methods which was developed to provide analysis of a standard line-search method with exact function values and random gradients to the case of noisy function. We introduce a condition on the gradient which when satisfied with some sufficiently large probability at each iteration, guarantees convergence properties of the line search method. We derive expected complexity bounds for convex, strongly convex and nonconvex functions. We motivate these results with several recent papers related to policy optimization.

Katya Scheinberg
Fri 5:00 p.m. - 5:45 p.m. [iCal]

We consider problems of smooth nonconvex optimization: unconstrained, bound-constrained, and with general equality constraints. We show that algorithms for these problems that are widely used in practice can be modified slightly in ways that guarantees convergence to approximate first- and second-order optimal points with complexity guarantees that depend on the desired accuracy. The methods we discuss are constructed from Newton's method, the conjugate gradient method, log-barrier method, and augmented Lagrangians. (In some cases, special structure of the objective function makes for only a weak dependence on the accuracy parameter.) Our methods require Hessian information only in the form of Hessian-vector products, so do not require the Hessian to be evaluated and stored explicitly. This talk describes joint work with Clement Royer, Yue Xie, and Michael O'Neill.

Stephen Wright
Fri 5:45 p.m. - 6:00 p.m. [iCal]

Final remarks for the workshop

Tasos Kyrillidis, Albert Berahas, Fred Roosta, Michael W Mahoney

Author Information

Tasos Kyrillidis (Rice University)
Albert Berahas (Lehigh University)
Fred Roosta (University of Queensland)
Michael W Mahoney (UC Berkeley)

More from the Same Authors