### Session

## Orals & Spotlights Track 21: Optimization

### Peter Richtarik · Marco Cuturi

**Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent**

Benjamin Recht · Christopher Ré · Stephen Wright · Feng Niu

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.

**Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form**

Hicham Janati · Boris Muzellec · Gabriel Peyré · Marco Cuturi

Although optimal transport (OT) problems admit closed form solutions in a very few notable cases, e.g. in 1D or between Gaussians, these closed forms have proved extremely fecund for practitioners to define tools inspired from the OT geometry. On the other hand, the numerical resolution of OT problems using entropic regularization has given rise to many applications, but because there are no known closed-form solutions for entropic regularized OT problems, these approaches are mostly algorithmic, not informed by elegant closed forms. In this paper, we propose to fill the void at the intersection between these two schools of thought in OT by proving that the entropy-regularized optimal transport problem between two Gaussian measures admits a closed form. Contrary to the unregularized case, for which the explicit form is given by the Wasserstein-Bures distance, the closed form we obtain is differentiable everywhere, even for Gaussians with degenerate covariance matrices. We obtain this closed form solution by solving the fixed-point equation behind Sinkhorn's algorithm, the default method for computing entropic regularized OT. Remarkably, this approach extends to the generalized unbalanced case --- where Gaussian measures are scaled by positive constants. This extension leads to a closed form expression for unbalanced Gaussians as well, and highlights the mass transportation / destruction trade-off seen in unbalanced optimal transport. Moreover, in both settings, we show that the optimal transportation plans are (scaled) Gaussians and provide analytical formulas of their parameters. These formulas constitute the first non-trivial closed forms for entropy-regularized optimal transport, thus providing a ground truth for the analysis of entropic OT and Sinkhorn's algorithm.

**Acceleration with a Ball Optimization Oracle**

Yair Carmon · Arun Jambulapati · Qijia Jiang · Yujia Jin · Yin Tat Lee · Aaron Sidford · Kevin Tian

Consider an oracle which takes a point x and returns the minimizer of a convex function f in an l*2 ball of radius r around x. It is straightforward to show that roughly r^{-1}\log(1/epsilon) calls to the oracle suffice to find an \epsilon-approximate minimizer of f in an l*2 unit ball. Perhaps surprisingly, this is not optimal: we design an accelerated algorithm which attains an epsilon-approximate minimizer with roughly r^{-2/3} \log(1/epsilon) oracle queries, and give a matching lower bound. Further, we implement ball optimization oracles for functions with a locally stable Hessian using a variant of Newton's method and, in certain cases, stochastic first-order methods. The resulting algorithms apply to a number of problems of practical and theoretical import, improving upon previous results for logistic and
l*infinity regression and achieving guarantees comparable to the
state-of-the-art for l*p regression.

**Convex optimization based on global lower second-order models**

Nikita Doikov · Yurii Nesterov

In this work, we present new second-order algorithms for composite convex optimization, called Contracting-domain Newton methods. These algorithms are affine-invariant and based on global second-order lower approximation for the smooth component of the objective. Our approach has an interpretation both as a second-order generalization of the conditional gradient method, or as a variant of trust-region scheme. Under the assumption, that the problem domain is bounded, we prove $O(1/k^2)$ global rate of convergence in functional residual, where $k$ is the iteration counter, minimizing convex functions with Lipschitz continuous Hessian. This significantly improves the previously known bound $O(1/k)$ for this type of algorithms. Additionally, we propose a stochastic extension of our method, and present computational results for solving empirical risk minimization problem.

**Adam with Bandit Sampling for Deep Learning**

Rui Liu · Tianyi Wu · Barzan Mozafari

Adam is a widely used optimization method for training deep learning models. It computes individual adaptive learning rates for different parameters. In this paper, we propose a generalization of Adam, called Adambs, that allows us to also adapt to different training examples based on their importance in the model's convergence. To achieve this, we maintain a distribution over all examples, selecting a mini-batch in each iteration by sampling according to this distribution, which we update using a multi-armed bandit algorithm. This ensures that examples that are more beneficial to the model training are sampled with higher probabilities. We theoretically show that Adambs improves the convergence rate of Adam---$O(\sqrt{\frac{\log n}{T} })$ instead of $O(\sqrt{\frac{n}{T}})$ in some cases. Experiments on various models and datasets demonstrate Adambs's fast convergence in practice.

**Explore Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize Scaling**

Yu-Guan Hsieh · Franck Iutzeler · Jérôme Malick · Panayotis Mertikopoulos

Owing to their stability and convergence speed, extragradient methods have become a staple for solving large-scale saddle-point problems in machine learning. The basic premise of these algorithms is the use of an extrapolation step before performing an update; thanks to this exploration step, extra-gradient methods overcome many of the non-convergence issues that plague gradient descent/ascent schemes. On the other hand, as we show in this paper, running vanilla extragradient with stochastic gradients may jeopardize its convergence, even in simple bilinear models. To overcome this failure, we investigate a double stepsize extragradient algorithm where the exploration step evolves at a more aggressive time-scale compared to the update step. We show that this modification allows the method to converge even with stochastic gradients, and we derive sharp convergence rates under an error bound condition.

**IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method**

Yossi Arjevani · Joan Bruna · Bugra Can · Mert Gurbuzbalaban · Stefanie Jegelka · Hongzhou Lin

We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA and SSDA. When coupled with accelerated gradient descent, our framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems.

**Revisiting Frank-Wolfe for Polytopes: Strict Complementarity and Sparsity**

Dan Garber

In recent years it was proved that simple modifications of the classical Frank-Wolfe algorithm (aka conditional gradient algorithm) for smooth convex minimization over convex and compact polytopes, converge with linear rate, assuming the objective function has the quadratic growth property. However, the rate of these methods depends explicitly on the dimension of the problem which cannot explain their empirical success for large scale problems. In this paper we first demonstrate that already for very simple problems and even when the optimal solution lies on a low-dimensional face of the polytope, such dependence on the dimension cannot be avoided in worst case. We then revisit the addition of a strict complementarity assumption already considered in Wolfe's classical book \cite{Wolfe1970}, and prove that under this condition, the Frank-Wolfe method with away-steps and line-search converges linearly with rate that depends explicitly only on the dimension of the optimal face, hence providing a significant improvement in case the optimal solution is sparse. We motivate this strict complementarity condition by proving that it implies sparsity-robustness of optimal solutions to noise.

**Minibatch Stochastic Approximate Proximal Point Methods**

Hilal Asi · Karan Chadha · Gary Cheng · John Duchi

We extend the Approximate-Proximal Point (aProx) family of model-based methods for solving stochastic convex optimization problems, including stochastic subgradient, proximal point, and bundle methods, to the minibatch setting. To do this, we propose two minibatched algorithms for which we prove a non-asymptotic upper bound on the rate of convergence, revealing a linear speedup in minibatch size. In contrast to standard stochastic gradient methods, these methods may have linear speedup in the minibatch setting even for non-smooth functions. Our algorithms maintain the desirable traits characteristic of the aProx family, such as robustness to initial step size choice. Additionally, we show improved convergence rates for "interpolation" problems, which (for example) gives a new parallelization strategy for alternating projections. We corroborate our theoretical results with extensive empirical testing, which demonstrates the gains provided by accurate modeling and minibatching.

**Finding Second-Order Stationary Points Efficiently in Smooth Nonconvex Linearly Constrained Optimization Problems**

Songtao Lu · Meisam Razaviyayn · Bo Yang · Kejun Huang · Mingyi Hong

This paper proposes two efficient algorithms for computing approximate second-order stationary points (SOSPs) of problems with generic smooth non-convex objective functions and generic linear constraints. While finding (approximate) SOSPs for the class of smooth non-convex linearly constrained problems is computationally intractable, we show that generic problem instances in this class can be solved efficiently. Specifically, for a generic problem instance, we show that certain strict complementarity (SC) condition holds for all Karush-Kuhn-Tucker (KKT) solutions. Based on this condition, we design an algorithm named Successive Negative-curvature grAdient Projection (SNAP), which performs either conventional gradient projection or some negative curvature-based projection steps to find SOSPs. SNAP is a second-order algorithm that requires $\widetilde{\mathcal{O}}(\max\{1/\epsilon^2_G,1/\epsilon^3_H\})$ iterations to compute an $(\epsilon_G,\epsilon_H)$-SOSP, where $\widetilde{\mathcal{O}}$ hides the iteration complexity for eigenvalue-decomposition. Building on SNAP, we propose a first-order algorithm, named SNAP$^+$, that requires $\mathcal{O}(1/\epsilon^{2.5})$ iterations to compute $(\epsilon, \sqrt{\epsilon})$-SOSP. The per-iteration computational complexities of our algorithms are polynomial in the number of constraints and problem dimension. To the best of our knowledge, this is the first time that first-order algorithms with polynomial per-iteration complexity and global sublinear rate are designed to find SOSPs of the important class of non-convex problems with linear constraints (almost surely).

**Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms**

Dheeraj Nagaraj · Xian Wu · Guy Bresler · Prateek Jain · Praneeth Netrapalli

We study the problem of least squares linear regression where the datapoints are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $\tmix$, the mixing time of the underlying Markov chain, under different noise settings. Our results establish that in general, optimization with Markovian data is strictly harder than optimization with independent data and a trivial algorithm (SGD-DD) that works with only one in every $\tmix$ samples, which are approximately independent, is minimax optimal. In fact, it is strictly better than the popular Stochastic Gradient Descent (SGD) method with constant step-size which is otherwise minimax optimal in the regression with independent data setting. Beyond a worst case analysis, we investigate whether structured datasets seen in practice such as Gaussian auto-regressive dynamics can admit more efficient optimization schemes. Surprisingly, even in this specific and natural setting, Stochastic Gradient Descent (SGD) with constant step-size is still no better than SGD-DD. Instead, we propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate. Our improved rate serves as one of the first results where an algorithm outperforms SGD-DD on an interesting Markov chain and also provides one of the first theoretical analyses to support the use of experience replay in practice.

**Linearly Converging Error Compensated SGD**

Eduard Gorbunov · Dmitry Kovalev · Dmitry Makarenko · Peter Richtarik

In this paper, we propose a unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates. Our framework is general enough to cover different variants of quantized SGD, Error-Compensated SGD (EC-SGD), and SGD with delayed updates (D-SGD). Via single theorem, we derive the complexity results for all the methods that fit our framework. For the existing methods, this theorem gives the best-known complexity results. Moreover, using our general scheme, we develop new variants of SGD that combine variance reduction or arbitrary sampling with error feedback and quantization and derive the convergence rates for these methods beating the state-of-the-art results. In order to illustrate the strength of our framework, we develop 16 new methods that fit this. In particular, we propose the first method called EC-SGD-DIANA that is based on error-feedback for biased compression operator and quantization of gradient differences and prove the convergence guarantees showing that EC-SGD-DIANA converges to the exact optimum asymptotically in expectation with constant learning rate for both convex and strongly convex objectives when workers compute full gradients of their loss functions. Moreover, for the case when the loss function of the worker has the form of finite sum, we modified the method and got a new one called EC-LSVRG-DIANA which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with constant learning rate.

**Learning Augmented Energy Minimization via Speed Scaling**

Etienne Bamas · Andreas Maggiori · Lars Rohwedder · Ola Svensson

As power management has become a primary concern in modern data centers, computing resources are being scaled dynamically to minimize energy consumption. We initiate the study of a variant of the classic online speed scaling problem, in which machine learning predictions about the future can be integrated naturally. Inspired by recent work on learning-augmented online algorithms, we propose an algorithm which incorporates predictions in a black-box manner and outperforms any online algorithm if the accuracy is high, yet maintains provable guarantees if the prediction is very inaccurate. We provide both theoretical and experimental evidence to support our claims.