`

Timezone: »

 
Workshop
OPT 2021: Optimization for Machine Learning
Courtney Paquette · Quanquan Gu · Oliver Hinder · Katya Scheinberg · Sebastian Stich · Martin Takac

Mon Dec 13 03:15 AM -- 02:35 PM (PST) @ None
Event URL: https://opt-ml.org/ »

OPT 2021 will bring experts in optimization to share their perspectives while leveraging crossover experts in ML to share their views and recent advances. OPT 2021 honors this tradition of bringing together people from optimization and from ML in order to promote and generate new interactions between the two communities.

To foster the spirit of innovation and collaboration, a goal of this workshop, OPT 2021 will focus the contributed talks on research in “Beyond Worst-case Complexity”. Classical optimization analyses measure the performances of algorithms based on (1). the computation cost and (2). convergence for any input into the algorithm. Yet algorithms with worse traditional complexity (e.g. SGD and its variants, ADAM, etc), are increasingly popular in practice for training deep neural networks and other ML tasks. This leads to questions such as what are good modeling assumptions for ML problems to measure an optimization algorithm’s success and how can we leverage these to better understand the performances of known (and new) algorithms. For instance, typical optimization problems in ML may be better conditioned than their worst-case counterparts in part because the problems are highly structured and/or high-dimensional (large number of features/samples). One could leverage this observation to design algorithms with better “average-case” complexity. Moreover, increasing research seems to indicate an intimate connection between the optimization algorithm and how well it performs on the test data (generalization). This new area of research in ML and its deep ties to optimization warrants a necessary discussion between the two communities. Specifically, we aim to continue the discussion on the precise meaning of generalization and average-case complexity and to formalize what this means for optimization algorithms. By bringing together experts in both fields, OPT 2021 will foster insightful discussions around these topics and more.

Mon 3:15 a.m. - 3:50 a.m.
(Social event/Break)  link »

Please join us in gather.town (see link below).

Mon 3:58 a.m. - 4:00 a.m.
Opening Remarks to Session 1 (Organizer intro)
Sebastian Stich
Mon 4:00 a.m. - 4:25 a.m.
(Plenary Speaker)   

Abstract: Deep learning revolutionized computer vision, speech recognition, natural language understanding, and more. However, from the theoretical perspective we know that training neural networks is computationally hard and one can construct distributions on which deep learning fails. The talk will propose several parameters of distributions that can move them from being easy-to-train to being hard-to-train.

Shai Shalev-Shwartz
Mon 4:25 a.m. - 4:30 a.m.
Q&A with Shai Shalev-Shwartz (Q&A)
Shai Shalev-Shwartz
Mon 4:30 a.m. - 4:55 a.m.
(Plenary Speaker)   

Abstract: Gradient methods form the foundation of current machine learning. A vast literature covers the use of stochastic gradients being simple unbiased estimators of the full gradient of our objective. In this talk, we discuss four applications motivated from practical machine learning, where this key assumption is violated, and show new ways to cope with gradients which are only loosely related to the original objective. We demonstrate that algorithms with rigorous convergence guarantees can still be obtained in such settings, for

  1. federated learning on heterogeneous data,

  2. personalized collaborative learning,

  3. masked training of neural networks with partial gradients,

  4. learning with malicious participants, in the sense of Byzantine robust training.

Martin Jaggi
Mon 4:55 a.m. - 5:00 a.m.
Q&A with Martin Jaggi (Q&A)
Martin Jaggi
Mon 5:00 a.m. - 5:30 a.m.
(Orals and spotlights)

Oral (10 min)

  • Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation, Futong Liu

Spotlights (5 min)

  • Decentralized Personalized Federated Learning: Lower Bounds and Optimal Algorithm for All Personalization Modes, Abdurakhmon Sadiev

  • Fast, Exact Subsampled Natural Gradients and First-Order KFAC, Frederik Benzing

  • Spherical Perspective on Learning with Normalization Layers, Simon Roburin

There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule.

Sebastian Stich · Futong Liu · Abdurakhmon Sadiev · Frederik Benzing · Simon Roburin
Mon 5:30 a.m. - 6:30 a.m.
(Poster session)  link »

Please join us in gather.town (see link above). To see the abstracts of the posters presented in this session, please see below the schedule.

Authors/papers presenting posters in gather.town for this session:

  • Gaussian Graphical Models as an Ensemble Method for Distributed Gaussian Processes, Hamed Jalali

  • DAdaQuant: Doubly-adaptive quantization for communication-efficient Federated Learning, Robert Hönig

  • Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training, Maximus Mutschler

  • COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization, Manuel Madeira

  • Random-reshuffled SARAH does not need a full gradient computations, Aleksandr Beznosikov

  • Decentralized Personalized Federated Learning: Lower Bounds and Optimal Algorithm for All Personalization Modes, Abdurakhmon Sadiev

  • Shifted Compression Framework: Generalizations and Improvements, Egor Shulgin

  • Faking Interpolation Until You Make It, Alasdair J Paren

  • Towards Modeling and Resolving Singular Parameter Spaces using Stratifolds, Pascal M Esser

  • Spherical Perspective on Learning with Normalization Layers, Simon W Roburin

  • Adaptive Optimization with Examplewise Gradients, Julius Kunze

  • On the Relation between Distributionally Robust Optimization and Data Curation, Agnieszka Słowik

  • Fast, Exact Subsampled Natural Gradients and First-Order KFAC, Frederik Benzing

  • Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation, Futong Liu

  • Community-based Layerwise Distributed Training of Graph Convolutional Networks, Hongyi Li

  • A New Scheme for Boosting with an Avarage Margin Distribution Oracle, Ryotaro Mitsuboshi

  • Better Linear Rates for SGD with Data Shuffling, Grigory Malinovsky

  • Structured Low-Rank Tensor Learning, Jayadev Naram

  • ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method, Zhize Li

  • EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback, Igor Sokolov

  • On Server-Side Stepsizes in Federated Optimization: Theory Explaining the Heuristics, Grigory Malinovsky

Hamed Jalali · Robert Hönig · Maximus Mutschler · Manuel Madeira · Aleksandr Beznosikov · Abdurakhmon Sadiev · Egor Shulgin · Alasdair Paren · Pascal Esser · Simon Roburin · Julius Kunze · Agnieszka Słowik · Frederik Benzing · Futong Liu · Hongyi Li · Ryotaro Mitsuboshi · Grigory Malinovsky · Jayadev Naram · Zhize Li · Igor Sokolov
Mon 6:30 a.m. - 6:58 a.m.
(Break)  link »

Please join us in gather.town (see link below),

Mon 6:58 a.m. - 7:00 a.m.
Opening Remarks to Session 2 (Organizer intro)
Courtney Paquette
Mon 7:00 a.m. - 7:25 a.m.
(Plenary Speaker)   

Authors: Coralia Cartis, Estelle Massart and Adilet Otemissov (Mathematical Institute, University of Oxford and Alan Turing Institute, London)

Abstract: We investigate the unconstrained and constrained global optimization of functions with low effective dimension, which are constant along an (unknown) linear subspace and only vary over the effective (complement) subspace. Known also as multi-ridge or active subspace objectives, these functions appear frequently in applications, such as in those involving some complex engineering simulations, overly parametrised models, and more. We aim to implicitly explore the intrinsic low dimensionality of the constrained/unconstrained landscape using (feasible) random embeddings, in order to understand and improve the scalability of algorithms for the global optimization of these special-structure problems. We show that for these functions, the convergence and complexity of a random embedding algorithmic framework can potentially be exponentially better than if no special structure was present, and furthermore, under certain assumptions/problems, they are independent of the ambient dimension. Our numerical experiments on functions with low effective dimension illustrate the improved scalability of the framework when coupled with state-of-the-art global — and even local — optimization solvers for the subproblems. Our analysis uses tools from random matrix theory, stochastic optimization and conic integral geometry.

Papers:

  • https://arxiv.org/pdf/2003.09673.pdf

    • https://arxiv.org/pdf/2009.10446.pdf

    • https://arxiv.org/pdf/2107.12102.pdf

Coralia Cartis
Mon 7:25 a.m. - 7:30 a.m.
Q&A with Coralia Cartis (Q&A)
Coralia Cartis
Mon 7:30 a.m. - 7:55 a.m.
(Plenary Speaker)   
Abstract: Ensuring privacy of users' data in machine learning models has become a crucial requirement in multiple domains. In this respect, differential privacy (DP) is the gold standard, due to its general and rigorous privacy guarantees, as well as its high composability. For the particular case of stochastic convex optimization (SCO), recent efforts have established optimal rates for the excess risk under differential privacy in Euclidean setups. These bounds suffer a polynomial degradation of accuracy with respect to the dimension, which limits their applicability in high-dimensional settings. In this talk, I will present nearly-dimension independent rates on the excess risk for DP-SCO in the $\ell_1$ setup, as well as the investigation of more general $\ell_p$ setups, where $1\leq p\leq \infty$. Based on joint work with Raef Bassily and Anupama Nandi.
Cristóbal Guzmán
Mon 7:55 a.m. - 8:00 a.m.
Q&A with Cristóbal Guzmán (Q&A)
Cristóbal Guzmán
Mon 8:00 a.m. - 8:30 a.m.
(Orals and spotlights)

Oral (10 min)

  • On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging, Junchi Li

Spotlights (5 min)

  • Farkas' Theorem of the Alternative for Prior Knowledge in Deep Networks, Jeffery Kline
  • Acceleration and Stability of the Stochastic Proximal Point Algorithm, Lyle Kim
  • Towards Modeling and Resolving Singular Parameter Spaces using Stratifolds, Pascal Esser

There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule.

Courtney Paquette · Chris Junchi Li · Jeff Kline · J. Lyle Kim · Pascal Esser
Mon 8:30 a.m. - 9:58 a.m.
Break  link »
Mon 9:58 a.m. - 10:00 a.m.
Opening Remarks to Session 3 (Organizer intro)
Oliver Hinder
Mon 10:00 a.m. - 10:25 a.m.
(Plenary Speaker)   

Abstract: We introduce a geometrically transparent strict saddle property for nonsmooth functions. When present, this property guarantees that simple randomly initialized proximal algorithms on weakly convex problems converge only to local minimizers. We argue that the strict saddle property may be a realistic assumption in applications since it provably holds for generic semi-algebraic optimization problems. Finally, we close the talk with an extension of the result to "perturbed" subgradient methods.

Damek Davis
Mon 10:25 a.m. - 10:30 a.m.
Q&A with Damek Davis (Q&A)
Damek Davis
Mon 10:30 a.m. - 10:55 a.m.
(Plenary Speaker)   

Abstract: Empirical Risk Minimization (ERM) problems are central to machine learning, and their efficient optimization has been studied from different perspectives, often taking advantage of the finite sum structure present in typical problem formulations. In particular, tight oracle complexity bounds have been established under fairly general assumptions about the loss functions. In this talk, I will present a rather surprising and general result that takes advantage of the separability of nonsmooth convex loss functions with efficiently computable proximal operators -- such as, e.g., the hinge loss and the sum of absolute errors -- to obtain an algorithm that exhibits significantly lower complexity than what is predicted by the lower bounds for general nonsmooth convex losses. I will then talk about how this result can be further improved for problems that can be stated in a form close to a linear program and discuss how these results relate to robust empirical risk minimization. The talk is based on joint results with Chaobing Song, Eric Lin, and Stephen Wright.

Jelena Diakonikolas
Mon 10:55 a.m. - 11:00 a.m.
Q&A with Jelena Diakonikolas (Q&A)
Jelena Diakonikolas
Mon 11:00 a.m. - 11:30 a.m.
(Orals and spotlights)

Oral (10 min)

  • Policy Mirror Descent for Regularized RL: A Generalized Framework with Linear Convergence, Wenhao Zhan

Spotlights (5 min)

  • Integer Programming Approaches To Subspace Clustering With Missing Data, Akhilesh Soni
  • DESTRESS: Computation-Optimal and Communication-Efficient Decentralized Nonconvex Finite-Sum Optimization, Boyue Li
  • Better Linear Rates for SGD with Data Shuffling, Grigory Malinovsky

There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule.

Oliver Hinder · Wenhao Zhan · Akhilesh Soni · Grigory Malinovsky · Boyue Li
Mon 11:30 a.m. - 12:30 p.m.
(Poster session)  link »

Please join us in gather.town (see link above). To see the abstracts of the posters presented in this session, please see below the schedule.

Authors/papers presenting posters in gather.town for this session:

  • Optimum-statistical Collaboration Towards Efficient Black-box Optimization, Wenjie Li

  • Integer Programming Approaches To Subspace Clustering With Missing Data, Akhilesh Soni

  • Stochastic Learning Equation using Monotone Increasing Resolution of Quantization, Jinwuk Seok

  • Sign-RIP: A Robust Restricted Isometry Property for Low-rank Matrix Recovery, Jianhao Ma

  • Farkas' Theorem of the Alternative for Prior Knowledge in Deep Networks, Jeffery Kline

  • Towards Robust and Automatic Hyper-Parameter Tunning, Mahdi S. Hosseini

  • High Probability Step Size Lower Bound for Adaptive Stochastic Optimization, Miaolan Xie

  • Stochastic Polyak Stepsize with a Moving Target, Robert M Gower

  • A Stochastic Momentum Method for Min-max Bilevel Optimization, Quanqi Hu

  • Deep Neural Networks pruning via the Structured Perspective Regularization, Matteo Cacciola

  • Efficient Calibration of Multi-Agent Market Simulators from Time Series with Bayesian Optimization, Yuanlu Bai

  • DESTRESS: Computation-Optimal and Communication-Efficient Decentralized Nonconvex Finite-Sum Optimization, Boyue Li

  • Policy Mirror Descent for Regularized RL: A Generalized Framework with Linear Convergence, Shicong Cen

  • Simulated Annealing for Neural Architecture Search, Shentong Mo

  • Acceleration and Stability of Stochastic Proximal Point Algorithm, Junhyung Lyle Kim

  • Barzilai and Borwein conjugate gradient method equipped with a non-monotone line search technique, Sajad Fathi Hafshejani

  • On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging, Chris Junchi Li

  • Practice-Consistent Analysis of Adam-Style Methods, Zhishuai Guo

  • Escaping Local Minima With Stochastic Noise, Harsh Vardhan

  • Optimization with Adaptive Step Size Selection from a Dynamical Systems Perspective, Neha S Wadia

  • Last-Iterate Convergence of Saddle Point Optimizers via High-Resolution Differential Equations, Tatjana Chavdarova

  • Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent, Sharan Vaswani

  • Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization, Difan Zou

  • Faster Perturbed Stochastic Gradient Methods for Finding Local Minima, Zixiang Chen

  • Adam vs. SGD: Closing the generalization gap on image classification, Aman Gupta

  • Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers, Frederik Kunstner

  • Faster Quasi-Newton Methods for Linear Composition Problems, Betty Shea

  • The Geometric Occam Razor Implicit in Deep Learning, Benoit Dherin

Wenjie Li · Akhilesh Soni · Jinwuk Seok · Jianhao Ma · Jeff Kline · Mathieu Tuli · Miaolan Xie · Robert Gower · Quanqi Hu · Matteo Cacciola · Yuanlu Bai · Boyue Li · Wenhao Zhan · Shentong Mo · J. Lyle Kim · Sajad Fathi Hafshejani · Chris Junchi Li · Zhishuai Guo · Harsh Harshvardhan · Neha Wadia · Tatjana Chavdarova · Sharan Vaswani · Difan Zou · Zixiang Chen · Aman Gupta · Jacques Chen · Betty Shea · Benoit Dherin
Mon 12:30 p.m. - 12:58 p.m.
(Break)  link »

Please join us in gather.town (see link below).

Mon 12:58 p.m. - 1:00 p.m.
Opening Remarks to Session 4 (Organizer intro)
Quanquan Gu
Mon 1:00 p.m. - 1:25 p.m.
(Plenary Speaker)   

Abstract: A natural optimization model that formulates many online resource allocations and decision-making problems is online linear programming (OLP) where the constraint matrix, along with the objective coefficients and decision variables, are revealed and decided column by column sequentially. We review the near optimal algorithms and theories for solving this surprisingly general class of online problems under the assumption of random order of arrival and/or iid distributions of the input data. Then we present few recent applications of the model/algorithm, including a fast online algorithm as a pre-solver for solving large-scale offline (binary) LPs, an interior-point online algorithm to address “fairness” for resource allocation, and a provable logarithmic regret bound for the Bandits with Knapsacks (BwK) problem.

Yinyu Ye
Mon 1:25 p.m. - 1:30 p.m.
Q&A with Yinyu Ye (Q&A)
Yinyu Ye
Mon 1:30 p.m. - 1:55 p.m.
(Plenary Speaker)   

Abstract: LAPACK (Linear Algebra PACKage) is a widely-used high-quality software library for numerical linear algebra that provides routines for solving systems of linear equations, linear least squares, eigenvalue problems, singular value decomposition, and matrix factorizations such as LU, QR, Cholesky and Schur decomposition. Randomized Numerical Linear Algebra (RandNLA) is an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems. In addition to providing some of the best linear algebra algorithms (in worst-case theory, numerical implementations, and non-trivial implicit statistical properties and machine learning implementations), RandNLA techniques are the basis for the best stochastic second-order optimization algorithms (such as SubSampled Newton's methods, Iterative Hessian Sketch, and Newton-LESS). The time has come to put RandNLA methods into the next generation of LAPACK, and we have begun to do that. We will present our high level plan to introduce RandNLA algorithms into LAPACK. The RandLAPACK library will implement state of the art randomized algorithms for problems such as low-rank approximation and least-squares, but its full scope will be larger than linear algebra per se. It will include certain higher-level primitives in optimization and machine learning that require judicious use of RandNLA. We will describe building blocks, the modular design that will help RandLAPACK evolve with the field of RandNLA (as well as the needs of machine learning, scientific computing, and other users) over time, as well as connections and implications for machine learning optimization. Joint work with Riley Murray, Jim Demmel, and others.

Michael Mahoney
Mon 1:55 p.m. - 2:00 p.m.
Q&A with Michael Mahoney (Q&A)
Michael Mahoney
Mon 2:00 p.m. - 2:30 p.m.
(Orals and spotlights)

Oral (10 min)

  • On the Relation between Distributionally Robust Optimization and Data Curation, Agnieszka Slowik

Spotlights (5 min)

  • Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers, Jacques Chen
  • Optimization with Adaptive Step Size Selection from a Dynamical Systems Perspective, Neha Wadia
  • Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization, Difan Zou

There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule.

Quanquan Gu · Agnieszka Słowik · Jacques Chen · Neha Wadia · Difan Zou
Mon 2:30 p.m. - 2:35 p.m.
Closing remarks (Organizer closing)
Courtney Paquette
-
(Poster) [ Visit Poster at Spot D5 in Virtual World ]
In the \emph{Subspace Clustering with Missing Data} (SCMD) problem, we are given a collection of $n$ data points, arranged into columns of a matrix $X \in \mathbb{R}^{d \times n}$, and each of the data points is observed only on a subset of its coordinates. The data points are assumed to be concentrated near a union of low-dimensional subspaces. The goal of SCMD is to cluster the vectors of the data matrix $X$ as per their subspace membership. State-of-the-art algorithms for SCMD can perform poorly on instances with a large amount of missing data or if the data matrix $X$ is nearly full-rank. We propose a novel integer programming-based method for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facility-location problem, with subspaces playing the role of facilities and data points that of customers. We propose a column-generation approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than state-of-the-art methods when the data is high-rank or the percentage of missing data is high.
Akhilesh Soni · Daniel Pimentel-Alarcón
-
(Spotlight)
In the \emph{Subspace Clustering with Missing Data} (SCMD) problem, we are given a collection of $n$ data points, arranged into columns of a matrix $X \in \mathbb{R}^{d \times n}$, and each of the data points is observed only on a subset of its coordinates. The data points are assumed to be concentrated near a union of low-dimensional subspaces. The goal of SCMD is to cluster the vectors of the data matrix $X$ as per their subspace membership. State-of-the-art algorithms for SCMD can perform poorly on instances with a large amount of missing data or if the data matrix $X$ is nearly full-rank. We propose a novel integer programming-based method for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facility-location problem, with subspaces playing the role of facilities and data points that of customers. We propose a column-generation approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than state-of-the-art methods when the data is high-rank or the percentage of missing data is high.
Akhilesh Soni · Daniel Pimentel-Alarcón
-
(Poster) [ Visit Poster at Spot D4 in Virtual World ]

In this paper, prior knowledge expressed in the form of polyhedral sets is introduced into the training of a deep neural network. This approach to using prior knowledge extends earlier work that applies Farkas' Theorem of the Alternative to linear support vector machine classifiers. Through this extension, we gain the ability to sculpt the decision boundary of a neural network by training on a set of discrete data while simultaneously fitting an uncountable number of points that live within a polytope that is defined by prior knowledge. We demonstrate viability of this approach on both synthetic and benchmark data sets.

Jeff Kline · Joseph Bockhorst
-
(Spotlight)

In this paper, prior knowledge expressed in the form of polyhedral sets is introduced into the training of a deep neural network. This approach to using prior knowledge extends earlier work that applies Farkas' Theorem of the Alternative to linear support vector machine classifiers. Through this extension, we gain the ability to sculpt the decision boundary of a neural network by training on a set of discrete data while simultaneously fitting an uncountable number of points that live within a polytope that is defined by prior knowledge. We demonstrate viability of this approach on both synthetic and benchmark data sets.

Jeff Kline · Joseph Bockhorst
-
(Poster) [ Visit Poster at Spot D0 in Virtual World ]

In this paper, we consider the formulation of the federated learning problem that is relevant to both decentralized personalized federated learning and multi-task learning. This formulation is widespread in the literature and represents the minimization of local losses with regularization taking into account the communication matrix of the network. First of all, we give lower bounds for the considered problem in different regularization regimes. We also constructed an optimal algorithm that matches these lower bounds. Additionally, we check the theoretical results with experiments.

Abdurakhmon Sadiev · Ekaterina Borodich · Darina Dvinskikh · Aleksandr Beznosikov · Alexander Gasnikov
-
(Spotlight)

In this paper, we consider the formulation of the federated learning problem that is relevant to both decentralized personalized federated learning and multi-task learning. This formulation is widespread in the literature and represents the minimization of local losses with regularization taking into account the communication matrix of the network. First of all, we give lower bounds for the considered problem in different regularization regimes. We also constructed an optimal algorithm that matches these lower bounds. Additionally, we check the theoretical results with experiments.

Abdurakhmon Sadiev · Ekaterina Borodich · Darina Dvinskikh · Aleksandr Beznosikov · Alexander Gasnikov
-
(Poster) [ Visit Poster at Spot C6 in Virtual World ]

When analyzing parametric statistical models, a useful approach consists in modeling geometrically the parameter space. However, even for very simple and commonly used hierarchical models like statistical mixtures or stochastic deep neural networks, the smoothness assumption of manifolds is violated at singular points which exhibit non-smooth neighborhoods in the parameter space. These singular models have been analyzed in the context of learning dynamics, where singularities can act as attractors on the learning trajectory and, therefore, negatively influence the convergence speed of models. We propose a general approach to circumvent the problem arising from singularities by using stratifolds, a concept from algebraic topology, to formally model singular parameter spaces. We use the property that specific stratifolds are equipped with a resolution method to construct a smooth manifold approximation of the singular space. We empirically show that using (natural) gradient descent on the smooth manifold approximation instead of the singular space allows us to avoid the attractor behavior and therefore improve the convergence speed in learning.

Pascal Esser · Frank Nielsen
-
(Spotlight)

When analyzing parametric statistical models, a useful approach consists in modeling geometrically the parameter space. However, even for very simple and commonly used hierarchical models like statistical mixtures or stochastic deep neural networks, the smoothness assumption of manifolds is violated at singular points which exhibit non-smooth neighborhoods in the parameter space. These singular models have been analyzed in the context of learning dynamics, where singularities can act as attractors on the learning trajectory and, therefore, negatively influence the convergence speed of models. We propose a general approach to circumvent the problem arising from singularities by using stratifolds, a concept from algebraic topology, to formally model singular parameter spaces. We use the property that specific stratifolds are equipped with a resolution method to construct a smooth manifold approximation of the singular space. We empirically show that using (natural) gradient descent on the smooth manifold approximation instead of the singular space allows us to avoid the attractor behavior and therefore improve the convergence speed in learning.

Pascal Esser · Frank Nielsen
-
(Poster) [ Visit Poster at Spot C5 in Virtual World ]
Normalization Layers (NL) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. We introduce a spherical framework to study the optimization of neural networks with NL from a geometric perspective. Concretely, we leverage the radial invariance of groups of parameters to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. We use it to derive the first effective learning rate expression of Adam. We then show theoretically and empirically that, in the presence of NL, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere.
Simon Roburin · Yann de Mont-Marin · Andrei Bursuc · Renaud Marlet · Patrick Pérez · Mathieu Aubry
-
(Spotlight)
Normalization Layers (NL) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. We introduce a spherical framework to study the optimization of neural networks with NL from a geometric perspective. Concretely, we leverage the radial invariance of groups of parameters to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. We use it to derive the first effective learning rate expression of Adam. We then show theoretically and empirically that, in the presence of NL, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere.
Simon Roburin · Yann de Mont-Marin · Andrei Bursuc · Renaud Marlet · Patrick Pérez · Mathieu Aubry
-
(Poster) [ Visit Poster at Spot D3 in Virtual World ]
We investigate how adaptive step size methods from numerical analysis can be used to speed up optimization routines. In contrast to line search strategies, the proposed methods recycle available gradient evaluations. Thus, neither evaluations of the objective function nor additional gradient evaluations are required, and the computational complexity per iteration remains at $\mathcal{O}(d)$, where $d$ is the problem dimension. On strongly convex functions, our results show a consistent improvement in the number of iterations by up to a factor of $2$ with the gradient method and of $1.4$ with the heavy ball method across a range of condition numbers.
Neha Wadia · Michael Jordan · Michael Muehlebach
-
(Spotlight)
We investigate how adaptive step size methods from numerical analysis can be used to speed up optimization routines. In contrast to line search strategies, the proposed methods recycle available gradient evaluations. Thus, neither evaluations of the objective function nor additional gradient evaluations are required, and the computational complexity per iteration remains at $\mathcal{O}(d)$, where $d$ is the problem dimension. On strongly convex functions, our results show a consistent improvement in the number of iterations by up to a factor of $2$ with the gradient method and of $1.4$ with the heavy ball method across a range of condition numbers.
Neha Wadia · Michael Jordan · Michael Muehlebach
-
(Poster) [ Visit Poster at Spot C4 in Virtual World ]

Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD, enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the {\em next} training data point to be learning from. Standard variants of SGD employ a {\em sampling with replacement} strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on {\em sampling without replacement} as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as {\em random shuffling} (RS), or before every epoch, strategy known as {\em random reshuffling} (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al (2020) provided tight {\em sublinear} convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shuffling-based methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining {\em linear} convergence rates in the strongly convex regime. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement. Moreover, we obtained improved convergence guarantees in convex and non-convex regimes as well.

Grigory Malinovsky · Alibek Sailanbayev · Peter Richtarik
-
(Spotlight)

Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD, enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the {\em next} training data point to be learning from. Standard variants of SGD employ a {\em sampling with replacement} strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on {\em sampling without replacement} as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as {\em random shuffling} (RS), or before every epoch, strategy known as {\em random reshuffling} (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al (2020) provided tight {\em sublinear} convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shuffling-based methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining {\em linear} convergence rates in the strongly convex regime. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement. Moreover, we obtained improved convergence guarantees in convex and non-convex regimes as well.

Grigory Malinovsky · Alibek Sailanbayev · Peter Richtarik
-
(Poster) [ Visit Poster at Spot C3 in Virtual World ]

Second-order optimization algorithms hold the potential to speed up learning in neural networks, but are notoriously hard to compute due to the enormous size of the curvature matrix. This problem has inspired approximations of the curvature matrix, which allow for efficient computations, most prominently the kronecker-factored curvature approximation (KFAC). Indeed, KFAC shows significant speed-ups for optimization compared to standard baselines. In this context, we challenge two common beliefs: Firstly we show that, when subsampling the curvature matrix (in our case the Fisher Information), second-order updates can be computed efficiently and exactly: The PyTorch implementation of our method requires less than twice as much amortised wall-clock time per parameter update than SGD. Secondly, through a careful set of experiments, we demonstrate that KFAC does not owe its performance to approximating the curvature matrix, but rather is closely linked to a new, simple first-order optimization algorithm. We propose and analyse using this first-order optimizer and demonstrate that it outperforms KFAC both in terms of computation cost and optimization progress per parameter update.

Frederik Benzing
-
(Spotlight)

Second-order optimization algorithms hold the potential to speed up learning in neural networks, but are notoriously hard to compute due to the enormous size of the curvature matrix. This problem has inspired approximations of the curvature matrix, which allow for efficient computations, most prominently the kronecker-factored curvature approximation (KFAC). Indeed, KFAC shows significant speed-ups for optimization compared to standard baselines. In this context, we challenge two common beliefs: Firstly we show that, when subsampling the curvature matrix (in our case the Fisher Information), second-order updates can be computed efficiently and exactly: The PyTorch implementation of our method requires less than twice as much amortised wall-clock time per parameter update than SGD. Secondly, through a careful set of experiments, we demonstrate that KFAC does not owe its performance to approximating the curvature matrix, but rather is closely linked to a new, simple first-order optimization algorithm. We propose and analyse using this first-order optimizer and demonstrate that it outperforms KFAC both in terms of computation cost and optimization progress per parameter update.

Frederik Benzing
-
(Poster) [ Visit Poster at Spot D2 in Virtual World ]

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.
In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the inferior generalization performance of Adam is fundamentally tied to the nonconvex landscape of deep learning optimization.

Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu
-
(Spotlight)

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization.
In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the inferior generalization performance of Adam is fundamentally tied to the nonconvex landscape of deep learning optimization.

Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu
-
(Poster) [ Visit Poster at Spot D1 in Virtual World ]

Emerging applications in multi-agent environments such as internet-of-things, networked sensing, autonomous systems and federated learning, call for decentralized algorithms for finite-sum optimizations that are resource-efficient in terms of both computation and communication. In this paper, we con- sider the prototypical setting where the agents work collaboratively to minimize the sum of local loss functions by only communicating with their neighbors over a predetermined network topology. We develop a new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex finite-sum optimization, which matches the near-optimal incremental first-order oracle (IFO) complexity of state-of-the-art centralized algorithms for finding first-order stationary points, and significantly improves over existing decentralized algorithms. The communication complexity of DESTRESS also improves upon prior arts over a wide range of parameter regimes. DESTRESS leverages several key algorithm design ideas including stochastic recursive gradient updates with mini-batches for local computation, gradient tracking with extra mixing for per-iteration communication, together with careful choices of hyper-parameters and new analysis frameworks to provably achieve a desirable computation-communication trade-off.

Boyue Li · Zhize Li · Yuejie Chi
-
(Spotlight)

Emerging applications in multi-agent environments such as internet-of-things, networked sensing, autonomous systems and federated learning, call for decentralized algorithms for finite-sum optimizations that are resource-efficient in terms of both computation and communication. In this paper, we con- sider the prototypical setting where the agents work collaboratively to minimize the sum of local loss functions by only communicating with their neighbors over a predetermined network topology. We develop a new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex finite-sum optimization, which matches the near-optimal incremental first-order oracle (IFO) complexity of state-of-the-art centralized algorithms for finding first-order stationary points, and significantly improves over existing decentralized algorithms. The communication complexity of DESTRESS also improves upon prior arts over a wide range of parameter regimes. DESTRESS leverages several key algorithm design ideas including stochastic recursive gradient updates with mini-batches for local computation, gradient tracking with extra mixing for per-iteration communication, together with careful choices of hyper-parameters and new analysis frameworks to provably achieve a desirable computation-communication trade-off.

Boyue Li · Zhize Li · Yuejie Chi
-
(Poster) [ Visit Poster at Spot D0 in Virtual World ]

The success of Adam on deep learning architectures such as transformers has made it the default option in many application where stochastic gradient descent (SGD) does not work. Our theoretical understanding of this discrepancy is lagging and has prevented us from significantly improving either algorithm. Recent work advanced the hypothesis that Adam, and other heuristics like gradient clipping, outperform SGD on language tasks because the distribution of the stochastic gradients is more heavy-tailed. This suggest that Adam performs better because it is more robust to outliers, which is a promising avenue for designing better stochastic gradient estimators. However, it is unclear whether heavy-tailed noise is the cause of this discrepancy or another symptom. Through experiments that control for stochasticity by increasing the batch size, we present evidence that stochasticity and heavy-tailed noise is not the major factor in this gap.

Jacques Chen · Frederik Kunstner · Mark Schmidt
-
(Spotlight)

The success of Adam on deep learning architectures such as transformers has made it the default option in many application where stochastic gradient descent (SGD) does not work. Our theoretical understanding of this discrepancy is lagging and has prevented us from significantly improving either algorithm. Recent work advanced the hypothesis that Adam, and other heuristics like gradient clipping, outperform SGD on language tasks because the distribution of the stochastic gradients is more heavy-tailed. This suggest that Adam performs better because it is more robust to outliers, which is a promising avenue for designing better stochastic gradient estimators. However, it is unclear whether heavy-tailed noise is the cause of this discrepancy or another symptom. Through experiments that control for stochasticity by increasing the batch size, we present evidence that stochasticity and heavy-tailed noise is not the major factor in this gap.

Jacques Chen · Frederik Kunstner · Mark Schmidt
-
(Poster) [ Visit Poster at Spot C6 in Virtual World ]
Stochastic gradient descent (SGD) has emerged as the de-facto method for solving (unconstrained) stochastic optimization problems. However, it suffers from two fundamental limitations: $(i)$ slow convergence due to inaccurate gradient approximation, and $(ii)$ numerical instability, especially with respect to step size selection. To improve the slow convergence, accelerated variants such as stochastic gradient descent with momentum (SGDM) have been studied; however, the interference of gradient noise and momentum can aggravate the numerical instability. Proximal point methods, on the other hand, have gained much attention due to their numerical stability. Their stochastic accelerated variants though have received limited attention. To bridge this gap, we propose the stochastic proximal point algorithm with momentum (SPPAM), and study its convergence and stability. We show that SPPAM enjoys a better contraction factor compared to stochastic proximal point algorithm (SPPA), leading to faster convergence. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM.
J. Lyle Kim · Panos Toulis · Tasos Kyrillidis
-
(Spotlight)
Stochastic gradient descent (SGD) has emerged as the de-facto method for solving (unconstrained) stochastic optimization problems. However, it suffers from two fundamental limitations: $(i)$ slow convergence due to inaccurate gradient approximation, and $(ii)$ numerical instability, especially with respect to step size selection. To improve the slow convergence, accelerated variants such as stochastic gradient descent with momentum (SGDM) have been studied; however, the interference of gradient noise and momentum can aggravate the numerical instability. Proximal point methods, on the other hand, have gained much attention due to their numerical stability. Their stochastic accelerated variants though have received limited attention. To bridge this gap, we propose the stochastic proximal point algorithm with momentum (SPPAM), and study its convergence and stability. We show that SPPAM enjoys a better contraction factor compared to stochastic proximal point algorithm (SPPA), leading to faster convergence. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM.
J. Lyle Kim · Panos Toulis · Tasos Kyrillidis
-
(Poster) [ Visit Poster at Spot C2 in Virtual World ]

Distributed Gaussian process (DGP) is a popular approach to scale GP to big data which divides the training data into some subsets, performs local inference for each partition, and aggregates the results to acquire global prediction. To combine the local predictions, the conditional independence assumption is used which basically means there is a perfect diversity between the subsets. Although it keeps the aggregation tractable, it is often violated in practice and generally yields poor results. In this paper, we propose a novel approach for aggregating the Gaussian experts' predictions by Gaussian graphical model (GGM) where the target aggregation is defined as an unobserved latent variable and the local predictions are the observed variables. We first estimate the joint distribution of latent and observed variables using the Expectation-Maximization (EM) algorithm. The interaction between experts can be encoded by the precision matrix of the joint distribution and the aggregated predictions are obtained based on the property of conditional Gaussian distribution. Using both synthetic and real datasets, our experimental evaluations illustrate that our new method outperforms other state-of-the-art DGP approaches.

Hamed Jalali · Gjergji. Kasneci
-
(Poster) [ Visit Poster at Spot C1 in Virtual World ]

Federated Learning (FL) is a powerful technique to train a model on a server with data from several clients in a privacy-preserving manner. FL incurs significant communication costs because it repeatedly transmits the model between the server and clients. Recently proposed algorithms quantize the model parameters to efficiently compress FL communication. We find that dynamic adaptations of the quantization level can boost compression without sacrificing model quality. We introduce DAdaQuant as a doubly-adaptive quantization algorithm that dynamically changes the quantization level across time and different clients. Our experiments show that DAdaQuant consistently improves client-->server compression, outperforming the strongest non-adaptive baselines by up to 2.8x.

Robert Hönig · Yiren Zhao · Robert Mullins
-
(Poster) [ Visit Poster at Spot C5 in Virtual World ]

In this paper, we propose a new non-monotone conjugate gradient method for solving unconstrained nonlinear optimization problems. We first modify the non-monotone line search method by introducing a new trigonometric function to calculate the non-monotone parameter, which plays an essential role in the algorithm's efficiency. Then, we apply a convex combination of the Barzilai-Borwein method \cite{Barzilai} for calculating the value of step size in each iteration. Under some suitable assumptions, we prove that the new algorithm has the global convergence property. The efficiency and effectiveness of the proposed method are determined in practice by applying the algorithm to some standard test problems and non-negative matrix factorization problems.

Sajad Fathi Hafshejani · Daya Gaur · Shahadat Hossain · Robert Benkoczi
-
(Poster) [ Visit Poster at Spot C0 in Virtual World ]

A fundamental challenge in Deep Learning is to automatically find optimal step sizes for stochastic gradient descent. In traditional optimization, line searches are a commonly used method to determine step sizes. One problem in Deep Learning is that finding appropriate step sizes on the full-batch loss is unfeasibly expensive. Therefore, classical line search approaches, designed for losses without inherent noise, are usually not applicable. Recent empirical findings suggest that the full-batch loss behaves locally parabolically in the direction of noisy update step directions. Furthermore, the trend of the optimal update step size is changing slowly. By exploiting these findings, this work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. In the experiments conducted, our approach mostly outperforms SGD tuned with a piece-wise constant learning rate schedule and other line search approaches for Deep Learning across models, datasets, and batch sizes on validation and test accuracy.

Maximus Mutschler · Andreas Zell
-
(Poster) [ Visit Poster at Spot B6 in Virtual World ]

Graph convolutional network (GCN) has been successfully applied to many graph-based applications. Training a large-scale GCN model, however, is still challenging: Due to the node dependency and layer dependency of the GCN architecture, a huge amount of computational time and memory is required in the training process. In this paper, we propose a parallel and distributed GCN training algorithm based on the Alternating Direction Method of Multipliers (ADMM) to tackle the two challenges simultaneously. We first split the GCN layers into independent blocks to achieve layer parallelism. Furthermore, we reduce node dependency by dividing the graph into several dense communities such that each of them can be trained with an agent in parallel. Finally, we provide solutions for all subproblems in the community-based ADMM algorithm. Preliminary results demonstrate that our proposed community-based ADMM training algorithm can lead to more than triple speedup while achieving the best performance compared to state-of-the-art methods.

Hongyi Li · Junxiang Wang · Yongchao Wang · Yue Cheng · Liang Zhao Zhao
-
(Poster) [ Visit Poster at Spot C4 in Virtual World ]

We propose a general optimum-statistical collaboration framework for sequential black-box optimization. Based on general definitions of the resolution descriptor and the uncertainty quantifier, we provide a general regret analysis of the proposed framework. We then show that the proposed framework can be applied to a broader range of functions that have different smoothness, and it inspires tighter measures of the statistical uncertainty and thus a faster algorithm.

Wenjie Li · Chi-Hua Wang · Guang Cheng
-
(Poster) [ Visit Poster at Spot B5 in Virtual World ]

First-order methods for stochastic optimization have undeniable relevance. Variance reduction for these algorithms has become an important research topic. We exploit convexity and L-smoothness to improve the noisy estimates outputted by the stochastic gradient oracle. Our method, named COCO denoiser, is the joint maximum likelihood estimator of multiple function gradients from their noisy observations, subject to co-coercivity constraints between them. The resulting estimate is the solution of a convex Quadratically Constrained Quadratic Problem. Although this problem is expensive to solve by interior point methods, we exploit its structure to apply an accelerated first-order algorithm, the Fast Dual Proximal Gradient method. Besides analytically characterizing the proposed estimator, we show empirically that increasing the number and proximity of the queried points leads to better gradient estimates. We also apply COCO in stochastic settings by plugging it in existing algorithms, such as SGD, Adam or STRSAGA, outperforming their vanilla versions, even in scenarios where our modelling assumptions are mismatched.

Manuel Madeira · Renato Negrinho · Joao Xavier · Pedro Aguiar
-
(Poster) [ Visit Poster at Spot C3 in Virtual World ]

In this paper, we propose a quantized learning equation with a monotone increasing resolution of quantization and stochastic analysis for the proposed algorithm. According to the white noise hypothesis for the quantization error with dense and uniform distribution, we can regard the quantization error as i.i.d. white noise. Based on this, we show that the learning equation with monotonically increasing quantization resolution converges weakly as the distribution viewpoint. The analysis of this paper shows that global optimization is possible for a domain that satisfies the Lipschitz condition instead of local convergence properties such as the Hessian constraint of the objective function.

Jinwuk Seok · sikim00
-
(Poster) [ Visit Poster at Spot C2 in Virtual World ]

Restricted isometry property (RIP), essentially stating that the linear measurements are approximately norm-preserving, plays a crucial role in studying low-rank matrix recovery problem. However, RIP fails in the robust setting, when a subset of the measurements are grossly corrupted with noise. In this work, we propose a robust restricted isometry property, called Sign-RIP, and show its broad applications in robust low-rank matrix recovery. In particular, we show that Sign-RIP can guarantee the uniform convergence of the subdifferentials of the robust matrix recovery with nonsmooth loss function, even at the presence of arbitrarily dense and arbitrarily large outliers. Based on Sign-RIP, we characterize the location of the critical points in the robust rank-1 matrix recovery, and prove that they are either close to the true solution, or have small norm. Moreover, in the over-parameterized regime, where the rank of the true solution is over-estimated, we show that subgradient method converges to the true solution at a (nearly) dimension-free rate. We show that the new notion of sign-RIP enjoys almost the same sample complexity as its classical counterparts, but provides significantly better robustness against noise.

Jianhao Ma · Salar Fattahi
-
(Poster) [ Visit Poster at Spot C1 in Virtual World ]

In this paper, we present a simple and intuitive proof of convergence for a family of Adam-style methods (including Adam, AMSGrad, Adabound, etc.) with an increasing or large "momentum" parameter for the first-order moment, which gives an alternative yet more natural way to guarantee Adam converge in stochastic non-convex minimization. We also establish a variance diminishing result for the used stochastic gradient estimators. The analysis is based on a widely used but not fully understood stochastic estimator using moving average (SEMA), which only requires a general unbiased stochastic oracle. In particular, we analyze Adam-style methods based on the variance recursion property of SEMA for stochastic non-convex minimization.

Zhishuai Guo · Yi Xu · Wotao Yin · Rong Jin · Tianbao Yang
-
(Poster) [ Visit Poster at Spot C0 in Virtual World ]

The task of hyper-parameter optimization (HPO) is burdened with heavy computational costs due to the intractability of optimizing both a model's weights and its hyper-parameters simultaneously. In this work, we introduce a new class of HPO method and explore how the low-rank factorization of the convolutional weights of intermediate layers of a convolutional neural network can be used to define an analytical response surface for optimizing hyper-parameters, using only training data. We quantify how this surface behaves as a surrogate to model performance and can be solved using a trust-region search algorithm, which we call \AlgName. The algorithm outperforms state-of-the-art such as Bayesian Optimization and generalizes across model, optimizer, and dataset selection.

Mathieu Tuli · Mahdi Hosseini · Konstantinos N Plataniotis
-
(Poster) [ Visit Poster at Spot B4 in Virtual World ]

The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch. The aggregated stochastic gradients serve as an estimate of a full gradient in the SARAH algorithm. We provide a theoretical analysis of the proposed approach and conclude the paper with numerical experiments that demonstrate the efficiency of this approach.

Aleksandr Beznosikov · Martin Takac
-
(Poster) [ Visit Poster at Spot B3 in Virtual World ]

Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no a-priori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed SGD, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the difference between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings.

Egor Shulgin · Peter Richtarik
-
(Poster) [ Visit Poster at Spot B2 in Virtual World ]

Boosting is a standard framework for learning a large-margin sparse ensemble of base hypotheses, where the algorithm assumes an oracle called the base learner, which returns a base hypothesis h with maximum edge Ei[yih(xi)] with respect to a given distribution over the sample ((x1, y1), . . . , (xm, ym)). In particular, for the l1-norm regularized soft margin optimization problem, there are several boosting algorithms that have theoretical iteration bound for finding ε- approximate solutions. They are not as fast as classical LPBoost by Demiriz et al., which has no non-trivial iteration bound. In this paper, we propose a new boosting scheme, where we assume a special base learner, which returns the average margin distribution vector Eh[(yih(xi))mi=1] with respect to a certain distribution over the base hypothesis class H. Under this scheme, we pro- pose a boosting algorithm whose iteration bound is O((r ln m)/ε2) and running time is O(m ln m) per iteration, where r is the VC dimension of H. Moreover, we also propose an efficient implementation for the new base learner, given that a relevant sub-class H|S of H, is represented by a non-deterministic version of the Zero-suppressed Decision Diagram (NZDD), where NZDD is a data structure for representing a family of sets succinctly.

Ryotaro Mitsuboshi · Kohei Hatano · Eiji Takimoto
-
(Poster) [ Visit Poster at Spot D6 in Virtual World ]

In over-parameterized deep neural networks there can be many possible parameter configurations that fit the training data exactly. However, the properties of these interpolating solutions are poorly understood. We argue that over-parameterized neural networks trained with stochastic gradient descent are subject to a Geometric Occam Razor: these networks are implicitly regularized by the geometric model complexity. For one-dimensional regression, the geometric model complexity is simply given by the arc length of the function. For higher-dimensional settings, the geometric model complexity depends on the Dirichlet energy of the function. We explore the relationship between the Geometric Occam Razor, Dirichlet energy and known forms of implicit regularization. Finally, for ResNets trained on Cifar-10, we observe that Dirichlet energy measurements are consistent with the action of this implicit Geometric Occam Razor.

Benoit Dherin · Michael Munn · David Barrett
-
(Poster) [ Visit Poster at Spot B6 in Virtual World ]

With the recent advancements in Deep Learning, non-convex problems have received wide spread attention. However, while stochastic gradient descent (SGD) can often successfully optimize these non-convex models in practice, the theoretical analysis---mapping SGD to Gradient Descent (GD) in expectation---cannot explain global convergence, as GD gets trapped in local minima and stationary points. We introduce a novel proof framework to analyze stochastic methods on a class of structured non-convex functions. We prove that SGD converges linearly to approximate global minima on non-convex functions that are `close' to a convex-like (strongly convex or P\L) function when its iterates are perturbed with stochastic noise of appropriate strength (smoothing). Our analysis applies to a strictly larger class of functions than studied in prior work. We demonstrate that special cases of smoothing are equivalent to vanilla SGD in expectation, which explains why SGD can escape local minima in practice.

Harsh Harshvardhan · Sebastian Stich
-
(Poster) [ Visit Poster at Spot B0 in Virtual World ]

Deep over-parameterized neural networks exhibit the interpolation property on many data sets. That is, these models are able to achieve approximately zero loss on all training samples simultaneously. Recently, this property has been exploited to develop novel optimisation algorithms for this setting. These algorithms use the fact that the optimal loss value is known to employ a variation of a Polyak Step-size calculated on a stochastic batch of data. In this work, we introduce an algorithm that extends this idea to tasks where the interpolation property does not hold. As we no longer have access to the optimal loss values a priori, we instead estimate them for each sample online. To realise this, we introduce a simple but highly effective heuristic for approximating the optimal value based on previous loss evaluations. Through rigorous experimentation we show the effectiveness of our approach, which outperforms adaptive gradient and line search methods.

Alasdair Paren · Rudra Poudel · Pawan K Mudigonda
-
(Poster) [ Visit Poster at Spot B5 in Virtual World ]

Several classical adaptive optimization algorithms, such as line search and trust-region methods, have been recently extended to stochastic settings. Unlike the stochastic gradient method and its many variants, these algorithms do not use a pre-specified sequence of step sizes, but increase or decrease the step size adaptively according to the estimated progress of the algorithm. These algorithms rely on stochastic oracles that estimate function values, gradients, and Hessians in some cases. The accuracy requirement of these oracles is also adaptive and depends on the step size. In the deterministic setting, a lower bound on the step size is easily derived, however, in the stochastic setting, due to possible oracle failures, bounds on the step size have not been previously derived. In this paper, we give a lower bound on the step size that holds with high probability. This bound is dependent on the probability of the oracle failures, recovering the deterministic result as an extreme case when this probability is zero.

Katya Scheinberg · Miaolan Xie
-
(Poster) [ Visit Poster at Spot A6 in Virtual World ]

We propose a new, more general approach to the design of stochastic gradient-based optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups. To demonstrate the usefulness of this generalized approach, we develop Eve, an adaptation of the Adam optimizer which uses examplewise gradients to obtain more accurate second-moment estimates. We provide preliminary experiments, without hyperparameter tuning, which show that the new optimizer slightly outperforms Adam on a small scale benchmark and performs the same or worse on larger scale benchmarks. Further work is needed to refine the algorithm and tune hyperparameters.

Julius Kunze · James Townsend · David Barber
-
(Poster) [ Visit Poster at Spot A5 in Virtual World ]

We consider the problem of learning low-rank tensors from partial observations with structural constraints and propose a novel factorization of such tensors, which leads to a simpler optimization problem. The resulting problem is an optimization problem on manifolds. We develop first-order and second-order Riemannian optimization algorithms to solve it. The duality gap for the resulting problem is derived, and we experimentally verify the correctness of the proposed algorithm. We demonstrate the algorithm on Nonnegative constraints and Hankel constraints.

Jayadev Naram · Tanmay Sinha · Pawan Kumar
-
(Poster) [ Visit Poster at Spot A4 in Virtual World ]
In this paper, we propose a novel accelerated gradient method called ANITA for solving the fundamental finite-sum optimization problems. Concretely, we consider both general convex and strongly convex settings: i) For general convex finite-sum problems, ANITA improves previous state-of-the-art result given by Varag (Lan et al., 2019). In particular, for large-scale problems or the target error is not very small, i.e., $n \geq \frac{1}{\epsilon^2}$, ANITA obtains the \emph{first} optimal result $O(n)$, matching the lower bound $\Omega(n)$ provided by Woodworth and Srebro (2016), while previous results are $O(n \log \frac{1}{\epsilon})$ of Varag (Lan et al., 2019) and $O(\frac{n}{\sqrt{\epsilon}})$ of Katyusha (Allen-Zhu, 2017). ii) For strongly convex finite-sum problems, we also show that ANITA can achieve the optimal convergence rate $O\big((n+\sqrt{\frac{nL}{\mu}})\log\frac{1}{\epsilon}\big)$ matching the lower bound $\Omega\big((n+\sqrt{\frac{nL}{\mu}})\log\frac{1}{\epsilon}\big)$ provided by Lan and Zhou (2015). Besides, ANITA enjoys a simpler loopless algorithmic structure unlike previous accelerated algorithms such as Varag (Lan et al., 2019) and Katyusha (Allen-Zhu, 2017) where they use an inconvenient double-loop structure. Moreover, by exploiting the loopless structure of ANITA, we provide a new \emph{dynamic multi-stage convergence analysis}, which is the key technical part for improving previous results to the optimal rates. Finally, the numerical experiments show that ANITA converges faster than the previous state-of-the-art Varag (Lan et al., 2019), validating our theoretical results and confirming the practical superiority of ANITA. We believe that our new theoretical rates and convergence analysis for this fundamental finite-sum problem will directly lead to key improvements for many other related problems, such as distributed/federated/decentralized optimization problems. For instance, Li and Richtarik (2021) obtain the first compressed and accelerated result, substantially improving previous state-of-the-art results, by applying ANITA to the distributed optimization problems with compressed communication.
Zhize Li
-
(Poster) [ Visit Poster at Spot A3 in Virtual World ]
First proposed by Seide et al (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is $O(1/T^{2/3})$, the rate of gradient descent in the same regime is $O(1/T)$). Recently, Richt\'{a}rik et al (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Several of these techniques were never analyzed in conjunction with EF before, and in cases where they were (e.g., bidirectional compression), our rates are vastly superior.
Peter Richtarik · Igor Sokolov · Ilyas Fatkhullin · Eduard Gorbunov · Zhize Li
-
(Poster) [ Visit Poster at Spot B4 in Virtual World ]

We propose a new stochastic gradient method that uses recorded past loss values to compute adaptive stepsizes. Our starting point is to show that the SP (Stochastic Polyak) method directly exploits interpolated models. That is, SP is a subsampled Newton-Raphson method applied to solving certain interpolation equations. These interpolation equations only hold for models that interpolate the data. We then use this viewpoint to develop a new variant of the SP method that converges without interpolation called MOTAPS. The MOTAPS method uses n auxiliary variables, one for each data point, that track the loss value for each data point. These auxiliary variables and the loss values are then used to set the step size.

We provide a global convergence theory for MOTAPS by showing that it can be interpreted as a special variant of online SGD. We also perform several numerical experiments on convex learning problems, and non-convex learning problem based on image classification and language translation. In all of our tasks we show that MOTAPS is competitive with the relevant baseline method.

Robert Gower · Aaron Defazio · Mike Rabbat
-
(Poster) [ Visit Poster at Spot B3 in Virtual World ]

Several widely-used first-order saddle point optimization methods yield the same continuous-time ordinary differential equation (ODE) as that of the Gradient Descent Ascent (GDA) method when derived naively. However, their convergence properties are very different even in simple bilinear games. In this paper, we use a technique from fluid dynamics called High Resolution Differential Equations (HRDEs) to design ODEs that converge to saddle points. As our main result, we design an ODE that has last iterate convergence for monotone variational inequalities. To our knowledge, this is the first continuous-time dynamics shown to converge for such a general setting. We also provide an implicit discretization of our ODE and we show it has last iterate convergence at a rate O(1/\sqrt{T}), which was previously shown to be optimal Golowich et al, 2020, using only first-order smoothness of the monotone operator, in contrast to previous results that need second-order smoothness as well.

Tatjana Chavdarova · Michael Jordan · Emmanouil Zampetakis
-
(Poster) [ Visit Poster at Spot B2 in Virtual World ]

We aim to design step-size schemes that make stochastic gradient descent (SGD) adaptive to (i) the noise \sigma^2 in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number \kappa, we first prove that T iterations of SGD with Nesterov acceleration and exponentially decreasing step-sizes can achieve a near-optimal \tilde{O}(\exp(-T/\sqrt{\kappa}) + \sigma^2/T) convergence rate. Under a relaxed assumption on the noise, with the same step-size scheme and knowledge of the smoothness, we prove that SGD can achieve an \tilde{O}(\exp(-T/\kappa) + \sigma^2/T) rate. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD converges at the desired rate, but only to a neighborhood of the solution. Next, we use SGD with an offline estimate of the smoothness and prove convergence to the minimizer. However, its convergence is slowed down proportional to the estimation error and we prove a lower-bound justifying this slowdown. Compared to other step-size schemes, we empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.

Sharan Vaswani · Benjamin Dubois-Taine · Reza Babanezhad Harikandeh
-
(Poster) [ Visit Poster at Spot D1 in Virtual World ]
We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is extremely useful in the context of Federated Averaging (FedAvg) with local passes over the client data. In particular, we prove that whenever the local stepsizes are small and the update direction is given by FedAvg over all clients, one can take a big leap in the obtained direction and improve the nonconvex rate of convergence from $\mathcal{O}(\varepsilon^{-3})$ to $\mathcal{O}(\varepsilon^{-2})$. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small server-side stepsize. Together, our results on the advantage of large and small sever-side stepsizes give a formal theoretical justification for the practice of adaptive server-side optimization in federated learning. Moreover, we consider multiple strategies of client participation and cover the options of uniform client sampling, deterministic or adversarial permutation of clients, as well as random permutation of clients.
Grigory Malinovsky · Konstantin Mishchenko · Peter Richtarik
-
(Poster) [ Visit Poster at Spot B1 in Virtual World ]
In this paper, we study nonconvex min-max bilevel optimization problem where the outer objective function is non-convex and strongly concave and the inner objective function is strongly convex. This paper develops a single loop single timescale stochastic algorithm based on moving average estimator, which only requires a general unbiased stochastic oracle with bounded variance. To the best of our knowledge, the only existing work on min-max bilevel optimization focuses on the ones with an upper objective in certain structure and only achieves an oracle complexity of $\cO(\epsilon^{-5})$. Under some mild assumptions on the partial derivatives of both outer and inner objective functions, we provide the first convergence guarantee with an oracle complexity of $\cO(\epsilon^{-4})$ for a general class of min-max bilevel problems, which matches the optimal complexity order for solving stochastic nonconvex optimization under a general unbiased stochastic oracle model.
Quanqi Hu · Bokun Wang · Tianbao Yang
-
(Poster) [ Visit Poster at Spot B0 in Virtual World ]

In Machine Learning, Artificial Neural Networks (ANNs) are a very powerful tool, broadly used in many applications. Often, the selected (deep) architectures include many layers, and therefore a large amount of parameters, which makes training, storage and inference expensive. This motivated a stream of research about compressing the original networks into smaller ones without excessively sacrificing performances. Among the many proposed compression approaches, one of the most popular is \emph{pruning}, whereby entire elements of the ANN (links, nodes, channels, etc.) and the corresponding weights are deleted. Since the nature of the problem is inherently combinatorial (what elements to prune and what not), we propose a new pruning method based on Operations Research tools. We start from a natural Mixed-Integer Programming model for the problem, and we use the Perspective Reformulation technique to strengthen its continuous relaxation. Projecting away the indicator variables from this reformulation yields a new regularization term, which we call the Structured Perspective Regularization. That leads to structured pruning of the initial architecture. We test our method on some ResNet architectures applied to CIFAR-10, CIFAR-100 and ImageNet datasets, obtaining competitive performances w.r.t.~the state of the art for structured pruning.

Matteo Cacciola · Andrea Lodi · Xinlin Li
-
(Poster) [ Visit Poster at Spot A6 in Virtual World ]

Multi-agent market simulation is commonly used for testing trading strategies before deploying them to real-time trading. In electronic trading markets only the price or volume time series, that result from interaction of multiple market participants, are typically directly observable. Therefore, multi-agent market environments need to be calibrated so that the time series that result from interaction of simulated agents resemble historical -- which amounts to solving a highly complex large-scale optimization problem. In this paper, we propose a simple and efficient framework for calibrating multi-agent market simulator parameters from historical time series observations. First, we consider a novel concept of eligibility set to bypass the potential non-identifiability issue. Second, we generalize the two-sample Kolmogorov-Smirnov (K-S) test with Bonferroni correction to test the similarity between two high-dimensional time series distributions, which gives a simple yet effective distance metric between the time series sample sets. Third, we suggest using Bayesian optimization (BO) and trust-region BO (TuRBO) to minimize the aforementioned distance metric. Finally, we demonstrate the efficiency of our framework using numerical experiments.

Yuanlu Bai · Svitlana Vyetrenko · Henry Lam · Tucker Balch
-
(Poster) [ Visit Poster at Spot A5 in Virtual World ]
Escaping from saddle points and finding local minimum is a central problem in nonconvex optimization. Perturbed gradient methods are perhaps the simplest approach for this problem. However, to find $(\epsilon, \sqrt{\epsilon})$-approximate local minima, the existing best stochastic gradient complexity for this type of algorithms is $\tilde O(\epsilon^{-3.5})$, which is not optimal. In this paper, we propose \texttt{Pullback}, a faster perturbed stochastic gradient framework for finding local minima. We show that Pullback with stochastic gradient estimators such as SARAH/SPIDER and STORM can find $(\epsilon, \epsilon_{H})$-approximate local minima within $\tilde O(\epsilon^{-3} + \epsilon_{H}^{-6})$ stochastic gradient evaluations (or $\tilde O(\epsilon^{-3})$ when $\epsilon_H = \sqrt{\epsilon}$). The core idea of our framework is a step-size ``pullback'' scheme to control the average movement of the iterates, which leads to faster convergence to the local minima.
Zixiang Chen · Dongruo Zhou · Quanquan Gu
-
(Poster) [ Visit Poster at Spot A4 in Virtual World ]

Adam is an adaptive deep neural network training optimizer that has been widely used across a variety of applications. However, on image classification problems, its generalization performance is significantly worse than stochastic gradient descent (SGD). By tuning several inner hyperparameters of Adam, it is possible to lift the performance of Adam and close this gap; but this makes the use of Adam computationally expensive. In this paper, we use a new training approach based on layer-wise weight normalization (LAWN) to solidly improve Adam's performance and close the gap with SGD. LAWN also helps reduce the impact of batch size on Adam's performance. With speed in tact and performance vastly improved, the Adam-LAWN combination becomes an attractive optimizer for use in image classification.

Aman Gupta · Rohan Ramanath · Jun Shi · Sathiya Keerthi
-
(Poster) [ Visit Poster at Spot A3 in Virtual World ]

Gradient-based Neural Architecture Search (NAS) approaches have achieved remarkable progress in the automated machine learning community. However, previous methods would cause much search time and huge computation resources in a big search space for seeking an optimal network structure. In this work, we propose a novel Simulated Annealing algorithm for NAS, namely SA-NAS, by adding perturbations to the gradient-descent for saving search cost and boosting the predictive performance of the search architecture. Our proposed algorithm is easy to be adapted to current state-of-the-art methods in the literature. We conduct extensive experiments on various benchmarks where the results demonstrate the effectiveness and efficiency of our SA-NAS in reducing search time and saving computation resources. Compared to previous differentiable methods, our SA-NAS achieves comparable or better predictive performance under the same setting.

Shentong Mo · Jingfei Xia · Pinxu Ren
-
(Poster) [ Visit Poster at Spot A2 in Virtual World ]

Despite its lack of theoretical guarantees, the limited memory version of BFGS (l-BFGS) is widely used in practice for large-scale optimization problems. In particular, when combined with the Wolfe conditions, l-BFGS tends to find solutions significantly faster than methods with known convergence rates such as gradient descent (GD). Similarly, inexact stepsize searches have outsized practical impact despite, in theory, having the same worst-case complexity as using a constant step size. These search techniques are often used for linear composition problems which are known to allow efficient linesearches and subspace optimization (SO). In this paper, we propose a version of l-BFGS that is faster for linear composition problems. Our method combines a l-BFGS direction with a momentum direction using SO. We explore practical SO issues that include extending the Wolfe conditions from one- to multi-dimension. We experimentally compare our method to standard l-BFGS and to a SO method that is known to be efficient.

Betty Shea · Mark Schmidt
-
(Poster) [ Visit Poster at Spot A1 in Virtual World ]

We study the stochastic bilinear minimax optimization problem, presenting an analysis of the same-sample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. We first note that the last iterate of the basic SEG method only contracts to a fixed neighborhood of the Nash equilibrium, independent of the step size. This contrasts sharply with the standard setting of minimization where standard stochastic algorithms converge to a neighborhood that vanishes in proportion to the square-root (constant) step size. Under the same setting, however, we prove that when augmented with iteration averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure. In the interpolation setting, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting.

Chris Junchi Li · Yaodong Yu · Nicolas Loizou · Gauthier Gidel · Yi Ma · Nicolas Le Roux perso · Michael Jordan
-
(Oral)

We study the stochastic bilinear minimax optimization problem, presenting an analysis of the same-sample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. We first note that the last iterate of the basic SEG method only contracts to a fixed neighborhood of the Nash equilibrium, independent of the step size. This contrasts sharply with the standard setting of minimization where standard stochastic algorithms converge to a neighborhood that vanishes in proportion to the square-root (constant) step size. Under the same setting, however, we prove that when augmented with iteration averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure. In the interpolation setting, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting.

Chris Junchi Li · Yaodong Yu · Nicolas Loizou · Gauthier Gidel · Yi Ma · Nicolas Le Roux perso · Michael Jordan
-
(Poster) [ Visit Poster at Spot A2 in Virtual World ]

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation.

Agnieszka Słowik · Leon Bottou
-
(Oral)

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation.

Agnieszka Słowik · Leon Bottou
-
(Poster) [ Visit Poster at Spot A0 in Virtual World ]

Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need to encourage exploration, and to ensure certain structural properties due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL. Focusing on an infinite-horizon discounted Markov decision process, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent [Lan, 2021], the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We prove that our algorithm converges linearly over an entire range of learning rates, in a dimension-free fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this fast convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to confirm the applicability of GPMD.

Wenhao Zhan · Shicong Cen · Baihe Huang · Yuxin Chen · Jason Lee · Yuejie Chi
-
(Oral)

Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need to encourage exploration, and to ensure certain structural properties due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL. Focusing on an infinite-horizon discounted Markov decision process, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent [Lan, 2021], the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We prove that our algorithm converges linearly over an entire range of learning rates, in a dimension-free fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this fast convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to confirm the applicability of GPMD.

Wenhao Zhan · Shicong Cen · Baihe Huang · Yuxin Chen · Jason Lee · Yuejie Chi
-
(Poster) [ Visit Poster at Spot A0 in Virtual World ]

Over-parameterized deep neural networks are able to achieve excellent training accuracy while maintaining a small generalization error. It has also been found that they are able to fit arbitrary labels, and this behaviour is referred to as the phenomenon of memorization. In this work, we study the phenomenon of memorization with turn-over dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data). Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g. real data) and difficult examples (e.g. random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one. By showing the existence of memorization on random data and real data, we highlight the consistency between them regarding optimization and we emphasize the implication of memorization during optimization.

Futong Liu · Tao Lin · Martin Jaggi
-
(Oral)

Over-parameterized deep neural networks are able to achieve excellent training accuracy while maintaining a small generalization error. It has also been found that they are able to fit arbitrary labels, and this behaviour is referred to as the phenomenon of memorization. In this work, we study the phenomenon of memorization with turn-over dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data). Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g. real data) and difficult examples (e.g. random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one. By showing the existence of memorization on random data and real data, we highlight the consistency between them regarding optimization and we emphasize the implication of memorization during optimization.

Futong Liu · Tao Lin · Martin Jaggi

Author Information

Courtney Paquette (McGill University)
Quanquan Gu (UCLA)
Oliver Hinder (University of Pittsburgh)
Katya Scheinberg (Cornell)
Sebastian Stich (CISPA)

Dr. [Sebastian U. Stich](https://sstich.ch/) is a faculty at the CISPA Helmholtz Center for Information Security. Research interests: - *methods for machine learning and statistics*—at the interface of theory and practice - *collaborative learning* (distributed, federated and decentralized methods) - *optimization for machine learning* (adaptive stochastic methods and generalization performance)

Martin Takac (Mohamed bin Zayed University of Artificial Intelligence (MBZUAI))

More from the Same Authors