Timezone: »

Workshop
OPT 2022: Optimization for Machine Learning
Courtney Paquette · Sebastian Stich · Quanquan Gu · Cristóbal Guzmán · John Duchi

Sat Dec 03 06:55 AM -- 02:50 PM (PST) @ Room 295 - 296

OPT 2022 will bring experts in optimization to share their perspectives while leveraging crossover experts in ML to share their views and recent advances. OPT 2022 honors this tradition of bringing together people from optimization and from ML in order to promote and generate new interactions between the two communities.

To foster the spirit of innovation and collaboration, a goal of this workshop, OPT 2022 will focus the contributed talks on research in Reliable Optimization Methods for ML. Many optimization algorithms for ML were originally developed with the goal of handling computational constraints (e.g., stochastic gradient based algorithms). Moreover, the analyses of these algorithms followed the classical optimization approach where one measures the performances of algorithms based on (i) the computation cost and (ii) convergence for any input into the algorithm. As engineering capabilities increase and the wide adoption of ML into many real world usages, practitioners of ML are seeking optimization algorithms that go beyond finding the minimizer with the fastest algorithm. They want reliable methods that solve real-world complications that arise. For example, increasingly bad actors are attempting to fool models with deceptive data. This leads to questions such as what algorithms are more robust to adversarial attacks and can one design new algorithms that can thwart these attacks? The latter question motivates a new area of optimization focusing on game-theoretic environments, that is, environments where there are competing forces at play and devising guarantees. Beyond this, a main reason for the success of ML is that optimization algorithms seemingly generate points that learn from training data; that is, we want minimizers of training data to provide meaningful interpretations on new data (generalization) yet we do not understand what features (e.g., loss function, algorithm, depth of the architectures (deep learning), and/or training samples) yield better generalization properties. These new areas of solving practical ML problems and their deep ties to the optimization community warrants a necessary discussion between the two communities. Specifically, we aim to discuss the meanings of generalization as well as the challenges facing real-world applications of ML and the new paradigms for optimizers seeking to solve them.

Plenary Speakers: All invited speakers have agreed to coming in-person to the workshop.

* Niao He (ETH, Zurich, assistant professor)

* Zico Kolter (Carnegie Mellon University, associate professor)

* Lorenzo Rosasco (U Genova/MIT, assistant professor)

* Katya Scheinberg (Cornell, full professor)

* Aaron Sidford (Stanford, assistant professor)

 Sat 6:55 a.m. - 7:00 a.m. Welcome Remarks (Intro) Courtney Paquette 🔗 Sat 7:00 a.m. - 7:30 a.m. Katya Scheinberg, Stochastic Oracles and Where to Find Them (Plenary Speaker)    Title: Stochastic Oracles and Where to Find Them Abstract: Continuous optimization is a mature field, which has recently undergone major expansion and change. One of the key new directions is the development of methods that do not require exact information about the objective function. Nevertheless, the majority of these methods, from stochastic gradient descent to "zero-th order" methods use some kind of approximate first order information. We will introduce a general definition of a stochastic oracle and show how this definition applies in a variety of familiar settings, including simple stochastic gradient via sampling, traditional and randomized finite difference methods, but also more specialized settings, such as robust gradient estimation. We will also overview several stochastic methods and how the general definition extends to the oracles used by these methods. Katya Scheinberg 🔗 Sat 7:30 a.m. - 8:00 a.m. Contributed Talks 1 (Contributed talks)    Two (15 mins) contributed talks in this session. Tian Li, Differentially Private Adaptive Optimization with Delayed Preconditioners Guy Kornowski, On the Complexity of Finding Small Subgradients in Nonsmooth Optimization Courtney Paquette · Tian Li · Guy Kornowski 🔗 Sat 8:00 a.m. - 9:00 a.m. Poster Session 1 (Poster Session) Andrew Lowy · Thomas Bonnier · Yiling Xie · Guy Kornowski · Simon Schug · Seungyub Han · Nicolas Loizou · xinwei zhang · Laurent Condat · Tabea E. Röber · Si Yi Meng · Marco Mondelli · Runlong Zhou · Eshaan Nichani · Adrian Goldwaser · Rudrajit Das · Kayhan Behdin · Atish Agarwala · Mukul Gagrani · Gary Cheng · Tian Li · Haoran Sun · Hossein Taheri · Allen Liu · Siqi Zhang · Dmitrii Avdiukhin · Bradley Brown · Miaolan Xie · Junhyung Lyle Kim · Sharan Vaswani · Xinmeng Huang · Ganesh Ramachandra Kini · Angela Yuan · Weiqiang Zheng · Jiajin Li 🔗 Sat 9:00 a.m. - 9:30 a.m. Contributed Talks 2 (Contributed talks)    Two (15 min) contributed talks in this session. Aaron Defazio, Parameter Free Dual Averaging: Optimizing Lipschitz Functions in a Single Pass Jiajin Li, Nonsmooth Composite Nonconvex-Concave Minimax Optimization Quanquan Gu · Aaron Defazio · Jiajin Li 🔗 Sat 9:30 a.m. - 10:00 a.m. Niao He, Simple Fixes for Adaptive Gradient Methods for Nonconvex Min-Max Optimization (Plenary Speaker)    Title: Simple Fixes for Adaptive Gradient Methods for Nonconvex Min-Max Optimization Abstract: Adaptive gradient methods such as AdaGrad and Adam have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner and are successful in nonconvex minimization. When it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. In fact, even for a quadratic example, the naive combination of Gradient Descent Ascent with any existing adaptive stepsizes is proven to diverge if the initial primal-dual stepsize ratio is not carefully chosen. We introduce two simple fixes for these adaptive methods, allowing automatic adaptation to the time-scale separation necessary for fast convergence. The resulting algorithms are fully parameter-agnostic and achieve near-optimal complexities in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems, without a priori knowledge about problem-specific parameters. This is based on joint work with Junchi Yang and Xiang Li. Niao He 🔗 Sat 10:00 a.m. - 12:00 p.m. Lunch (Break) 🔗 Sat 12:00 p.m. - 12:30 p.m. Zico Kolter, Adapt like you train: How optimization at training time affects model finetuning and adaptation (Plenary Speaker)    Title: Adapt like you train: How optimization at training time affects model finetuning and adaptation Abstract: With the growing use of large-scale machine learning models pretrained on massive datasets, it is becoming increasingly important to understand how we can efficiently adapt these models to downstream tasks at test time. In this talk, I will discuss our recent work that highlights an important but often overlooked factor in this process: specifically, we have found in several cases that the loss function used to train the model has important implications as to the best way to finetune or adapt the model. I will highlight two specific examples of this phenomenon: 1) illustrating that using contrastive loss outperforms alternatives for fine-tuning contrastively-pretrained vision-language models; and 2) showing how we can leverage the convex conjugate of the training loss to perform label-free test time adaptation. I will end by highlighting open questions and directions for this work. J. Zico Kolter 🔗 Sat 12:30 p.m. - 1:15 p.m. Contributed Talks 3 (Contributed talks)    Three (15 min) contributed talks in this session. Fangshuo Liao, Strong Lottery Ticket Hypothesis with \epsilon–Perturbation Vishwak Srinivasan, Sufficient conditions for non-asymptotic convergence of Riemannian optimization methods Zhiyuan Li, How Does Sharpness-Aware Minimization Minimizes Sharpness? Cristóbal Guzmán · Fangshuo Liao · Vishwak Srinivasan · Zhiyuan Li 🔗 Sat 1:15 p.m. - 1:45 p.m. Aaron Sidford, Efficiently Minimizing the Maximum Loss (Plenary Speaker)    Title: Efficiently Minimizing the Maximum Loss Abstract: In this talk I will discuss recent advances in the fundamental robust optimization problem of minimizing the maximum of a finite number of convex loss functions. In particular I will show how to develop stochastic methods for approximately solving this problem with a near-optimal number of gradient queries. Along the way, I will cover several optimization techniques of broader utility, including accelerated methods for using ball-optimization oracles and stochastic bias-reduced gradient methods. This talk will include joint work with Hilal Asi, Yair Carmon, Arun Jambulapati, and Yujia Jin including https://arxiv.org/abs/2105.01778 and https://arxiv.org/abs/2106.09481. Aaron Sidford 🔗 Sat 1:45 p.m. - 1:50 p.m. Closing Remarks (Closing) Courtney Paquette 🔗 Sat 1:50 p.m. - 2:50 p.m. Poster Session 2 (Poster Session) Jinwuk Seok · Bo Liu · Ryotaro Mitsuboshi · David Martinez-Rubio · Weiqiang Zheng · Ilgee Hong · Chen Fan · Kazusato Oko · Bo Tang · Miao Cheng · Aaron Defazio · Tim G. J. Rudner · Gabriele Farina · Vishwak Srinivasan · Ruichen Jiang · Peng Wang · Jane Lee · Nathan Wycoff · Nikhil Ghosh · Yinbin Han · David Mueller · Liu Yang · Amrutha Varshini Ramesh · Siqi Zhang · Kaifeng Lyu · David Yunis · Kumar Kshitij Patel · Fangshuo Liao · Dmitrii Avdiukhin · Xiang Li · Sattar Vakili · Jiaxin Shi 🔗 - A Finite-Particle Convergence Rate for Stein Variational Gradient Descent (Poster)  link »    We provide a first finite-particle convergence rate for Stein variational gradient descent (SVGD). Specifically, with certain choices of step size sequences, SVGD with n particles drive the kernel Stein discrepancy to zero at the rate $O\left(\frac{1}{\sqrt{\log\log n}} \right)$. We suspect that the dependence on $n$ can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements. Link » Jiaxin Shi · Lester Mackey 🔗 - Data-heterogeneity-aware Mixing for Decentralized Learning (Poster)  link »    Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches towards decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes. We propose a metric that quantifies the ability of a graph to mix the current gradients. We further prove that the metric controls the convergence rate, particularly in settings where the heterogeneity across nodes dominates the stochasticity between updates for a given node. Motivated by our analysis, we propose an approach that periodically and efficiently optimizes the metric using standard convex constrained optimization and sketching techniques. Link » Yatin Dandi · Anastasiia Koloskova · Martin Jaggi · Sebastian Stich 🔗 - Rieoptax: Riemannian Optimization in JAX (Poster)  link »    We present Rieoptax, an open source Python library for Riemannian optimization in JAX. We show that many differential geometric primitives, such as Riemannian exponential and logarithm maps, are usually faster in Rieoptax than existing frameworks in Python, both on CPU and GPU. We support various range of basic and advanced stochastic optimization solvers like Riemannian stochastic gradient, stochastic variance reduction, and adaptive gradient methods. A distinguishing feature of the proposed toolbox is that we also support differentially private optimization on Riemannian manifolds. Link » Saiteja Utpala · Andi Han · Pratik Kumar Jawanpuria · Bamdev Mishra 🔗 - TorchOpt: An Efficient library for Differentiable Optimization (Poster)  link »    Recent years have witnessed the booming of various differentiable optimization algorithms. These algorithms exhibit different execution patterns, and their execution needs massive computational resources that go beyond a single CPU and GPU. Existing differentiable optimization libraries, however, cannot support efficient algorithm development and multi-CPU/GPU execution, making the development of differentiable optimization algorithms often cumbersome and expensive.This paper introduces TorchOpt, a PyTorch-based efficient library for differentiable optimization. TorchOpt provides a unified and expressive bi-level optimization programming abstraction. This abstraction allows users to efficiently declare and analyze various differentiable optimization programs with explicit gradients, implicit gradients, and zero-order gradients. TorchOpt further provides a high-performance distributed execution runtime. This runtime can fully parallelize computation-intensive differentiation operations (e.g. tensor tree flatten) on CPUs/GPUs and automatically distribute computation to distributed devices. Experimental results show that TorchOpt outperforms state-of-the-art libraries by $7\times$ on an 8-GPU server. TorchOpt will be open source with a permissive Apache-2.0 License. Link » Jie Ren · Xidong Feng · Bo Liu · Xuehai Pan · Yao Fu · Luo Mai · Yaodong Yang 🔗 - On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data (Poster)  link » It has been empirically observed that training large models with weighted cross-entropy (CE) beyond the zero-training-error regime is not a satisfactory remedy for label-imbalanced data. Instead, researchers have proposed the vector-scaling (VS) loss, as a parameterization of the CE loss that is tailored to this modern training regime. The driving force to understand the impact of such parameterizations on the gradient-descent path has been the theory of implicit bias. Specifically for linear(ized) models, this theory allows to explain why weighted CE fails and how the VS-loss biases the optimization path towards solutions that favor minorities. However, beyond linear models the description of implicit bias is more obscure. In order to gain insights on the impact of different CE-parameterizations in non-linear models, we investigate their implicit geometry of learnt classifiers and embeddings. Our main result characterizes the global minimizers of a non-convex cost-sensitive SVM classifier for the so-called unconstrained features model, which serves as an abstraction of deep models. We also study empirically the convergence of SGD to this global minimizer observing slow-downs with increasing imbalance ratios and scalings of the loss hyperparameters. Link » Tina Behnia · Ganesh Ramachandra Kini · Vala Vakilian · Christos Thrampoulidis 🔗 - Rethinking Sharpness-Aware Minimization as Variational Inference (Poster)  link »    Sharpness-aware minimisation (SAM) aims to improve the generalisation of gradient-based learn-ing by seeking out flat minima. In this work, we establish connections between SAM and mean-field variational inference (MFVI) of neural network parameters. We show that both these methodshave interpretations as optimizing notions of flatness, and when using the reparametrisation trick,they both boil down to calculating the gradient at a perturbed version of the current mean param-eter. This thinking motivates our study of algorithms that combine or interpolate between SAMand MFVI. We evaluate the proposed variational algorithms on several benchmark datasets, andcompare their performance to variants of SAM. Taking a broader perspective, our work suggeststhat SAM-like updates can be used as a drop-in replacement for the reparametrisation trick. Link » Szilvia Ujváry · Zsigmond Telek · Anna Kerekes · Anna Mészáros · Ferenc Huszar 🔗 - TiAda: A Time-scale Adaptive Algorithm For Nonconvex Minimax Optimization (Poster)  link » Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications. Link » Xiang Li · Junchi YANG · Niao He 🔗 - TiAda: A Time-scale Adaptive Algorithm For Nonconvex Minimax Optimization (Oral)  link »    Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications. Link » Xiang Li · Junchi YANG · Niao He 🔗 - Toward Understanding Why Adam Converges Faster Than SGD for Transformers (Poster)  link »    While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question why Adam converges significantly faster than SGD in these scenarios. In this paper, we explore one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms. We demonstrate the effect of coordinate-wise clipping in sharpness reduction and speeding up the convergence of optimization algorithms under various settings, and conclude that the sharpness reduction effect of adaptive coordinate-wise scaling is the reason for Adam’s success in practice. Link » Yan Pan · Yuanzhi Li 🔗 - Bidirectional Adaptive Communication for Heterogeneous Distributed Learning (Poster)  link » Communication is a key bottleneck in distributed optimization, and, in particular, bandwidth and latency can be limiting factors when devices are connected over commodity networks, such as in Federated Learning. State-of-the-art techniques tackle these challenges by advanced compression techniques or delaying communication rounds according to predefined schedules. We present a new scheme that adaptively skips communication (broadcast and client uploads) by detecting slow-varying updates. The scheme automatically adjusts the communication frequency independently for each worker and the server. By utilizing an error-feedback mechanism~-- borrowed from the compression literature~--~we prove that the convergence rate is the same as for batch gradient descent %strongly-convex, in the convex and nonconvex smooth cases. We show that the total number of communication rounds between server and clients needed to achieve a targeted accuracy is reduced, even in the case when the data distribution is highly non-IID. Link » Dmitrii Avdiukhin · Vladimir Braverman · Nikita Ivkin · Sebastian Stich 🔗 - How Sharpness-Aware Minimization Minimizes Sharpness? (Poster)  link » Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes another notion of sharpness, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the implicit alignment between the gradient and the top eigenvector of Hessian when running SAM. Link » Kaiyue Wen · Tengyu Ma · Zhiyuan Li 🔗 - How Sharpness-Aware Minimization Minimizes Sharpness? (Oral)  link » Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes another notion of sharpness, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the implicit alignment between the gradient and the top eigenvector of Hessian when running SAM. Link » Kaiyue Wen · Tengyu Ma · Zhiyuan Li 🔗 - Strong Lottery Ticket Hypothesis with $\epsilon$–perturbation (Poster)  link »    The strong Lottery Ticket Hypothesis (LTH) claims that there exists a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural networks without the need of training. This work extends the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change achieved in the pre-training step to some perturbation around the initialization.In particular, we focus on the following open questions: By allowing an $\varepsilon$-scale perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? Link » Fangshuo Liao · Zheyang Xiong · Anastasios Kyrillidis 🔗 - Strong Lottery Ticket Hypothesis with $\epsilon$–perturbation (Oral)  link »    The strong Lottery Ticket Hypothesis (LTH) claims that there exists a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural networks without the need of training. This work extends the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change achieved in the pre-training step to some perturbation around the initialization.In particular, we focus on the following open questions: By allowing an $\varepsilon$-scale perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? Link » Fangshuo Liao · Zheyang Xiong · Anastasios Kyrillidis 🔗 - Optimization for Robustness Evaluation beyond ℓp Metrics (Poster)  link »    Empirical evaluations of neural network models against adversarial attacks entail solving nontrivial constrained optimization problems. Popular algorithms for solving these constrained problems rely on projected gradient descent (PGD) and require careful tuning of multiple hyperparameters. Moreover, PGD can only handle $\ell_1$, $\ell_2$, and $\ell_\infty$ attacks due to the use of analytical projectors. In this paper, we introduce an alternative algorithmic framework that blends a general-purpose constrained-optimization solver \pygranso, \textbf{W}ith \textbf{C}onstraint-\textbf{F}olding (PWCF), to add reliability and generality to the existing adversarial evaluations. PWCF 1) finds good-quality solutions without delicate tuning of multiple hyperparameters; 2) can handle general attack models which are inaccessible to the existing algorithms, e.g., $\ell_{p > 0}$, and perceptual attacks. Link » Hengyue Liang · Buyun Liang · Ying Cui · Tim Mitchell · Ju Sun 🔗 - Neural Networks Efficiently Learn Low-Dimensional Representations with SGD (Poster)  link »    We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multiple-index model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the first-layer weights of the NN converge to the $k$-dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $\mathcal{O}(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ sample requirement to learn any degree $p$ polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Link » Alireza Mousavi-Hosseini · Sejun Park · Manuela Girotti · Ioannis Mitliagkas · Murat Erdogdu 🔗 - Nesterov Meets Optimism: Rate-Optimal Optimistic-Gradient-Based Method for Stochastic Bilinearly-Coupled Minimax Optimization (Poster)  link »    We provide a novel first-order optimization for bilinearly-coupled strongly-convex-concave minimax optimization we call Accelerated Optimistic Gradient (AG-OG). The main idea of our algorithm is to leverage the structure of the considered minimax problem by using Nesterov’s acceleration on the individual parts and optimism on the coupling term of the objective. We first motivate our method by showing that the dynamics of its continuous version correspond to a linear combination of the ODEs of the dynamics of the Optimistic Gradient and the dynamics of Nesterov‘s acceleration. This continuous time approach allows us to showcase the main properties of our method that will eventually guide our analysis in the discrete case. Furthermore by properly restarting AG-OG, we show that we can achieve optimal (up to a constant) convergence rates with respect to the conditioning of the coupling and individual parts in the stochastic and deterministic cases. Link » Chris Junchi Li · Angela Yuan · Gauthier Gidel · Michael Jordan 🔗 - Distributed Online and Bandit Convex Optimization (Poster)  link »    We study the problems of distributed online and bandit convex optimization against an adaptive adversary. We want to minimize the average regret on M machines that communicate R times intermittently. We show that collaboration is not beneficial if the machines can access gradients of the cost functions at each time step, i.e., they have first-order feedback. In this setting, simple non-collaborative algorithms are min-max optimal. This contrasts with the provable benefit of collaboration when optimizing against a stochastic adversary, which samples the cost functions from fixed distributions. To identify the benefit of collaboration, we consider the harder setting where the machines can only access values of their cost function, i.e., they have bandit feedback. Here, we identify the high-dimensional regime where collaboration is beneficial and may even lead to a linear speedup in the number of machines. Our results bridge the gap between distributed online optimization against stochastic and adaptive adversaries. Link » Kumar Kshitij Patel · Aadirupa Saha · Nati Srebro · Lingxiao Wang 🔗 - On Convexity and Linear Mode Connectivity in Neural Networks (Poster)  link »    In many cases, neural networks trained with stochastic gradient descent (SGD) that share an early and often small portion of the training trajectory have solutions connected by a linear path of low loss. This phenomenon, called linear mode connectivity (LMC), has been leveraged for pruning and model averaging in large neural network models, but it is not well understood how broadly or why it occurs. LMC suggests that SGD trajectories somehow end up in a \textit{convex"} region of the loss landscape and stay there. In this work, we confirm that this eventually does happen by finding a high-dimensional convex hull of low loss between the endpoints of several SGD trajectories. But to our surprise, simple measures of convexity do not show any obvious transition at the point when SGD will converge into this region. To understand this convex hull better, we investigate the functional behaviors of its endpoints. We find that only a small number of correct predictions are shared between all endpoints of a hull, and an even smaller number of correct predictions are shared between the hulls, even when the final accuracy is high for every endpoint. Thus, we tie LMC more tightly to convexity, and raise several new questions about the source of this convexity in neural network optimization. Link » David Yunis · Kumar Kshitij Patel · Pedro Savarese · Gal Vardi · Jonathan Frankle · Matthew Walter · Karen Livescu · Michael Maire 🔗 - Optimal Complexity in Non-Convex Decentralized Learning over Time-Varying Networks (Poster)  link »    Decentralized optimization with time-varying networks is an emerging paradigm in machine learning. It saves remarkable communication overhead in large-scale deep training and is more robust in wireless scenarios especially when nodes are moving. Federated learning can also be regarded as decentralized optimization with time-varying communication patterns alternating between global averaging and local updates.While numerous studies exist to clarify its theoretical limits and develop efficient algorithms, it remains unclear what the optimal complexity is for non-convex decentralized stochastic optimization over time-varying networks. The main difficulties lie in how to gauge the effectiveness when transmitting messages between two nodes via time-varying communications, and how to establish the lower bound when the network size is fixed (which is a prerequisite in stochastic optimization). This paper resolves these challenges and establish the first lower bound complexity. We also develop a new decentralized algorithm to nearly attain the lower bound, showing the tightness of the lower bound and the optimality of our algorithm. Link » Xinmeng Huang · Kun Yuan 🔗 - Target-based Surrogates for Stochastic Optimization (Poster)  link »    We consider minimizing functions for which it is computationally expensive to query the (stochastic) gradient. Such functions are prevalent in applications like reinforcement learning, online imitation learning and bilevel optimization. We exploit the composite structure in these functions and propose a target optimization framework. Our framework leverages the smoothness of the loss with respect to an intermediate target space (e.g. the output of a neural network model), and uses gradient information to construct surrogate functions. In the full-batch setting, we prove that the surrogate function is a global upper-bound on the overall loss, and can be (locally) minimized using any black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point at an $O\left(\frac{1}{T}\right)$ rate, thus matching gradient descent. In the stochastic setting, we propose a stochastic surrogate optimization (SSO) algorithm that can be viewed as projected stochastic gradient descent in the target space. We leverage this connection in order to prove that SSO can match the SGD rate for strongly-convex functions. Experimentally, we evaluate the SSO algorithm on convex supervised learning losses and show competitive performance compared to SGD and its variants. Link » Jonathan Lavington · Sharan Vaswani · Reza Babanezhad Harikandeh · Mark Schmidt · Nicolas Le Roux 🔗 - Why (and When) does Local SGD Generalize Better than SGD? (Poster)  link »    Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD with a small learning rate, after approaching the manifold of minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement. Link » Xinran Gu · Kaifeng Lyu · Longbo Huang · Sanjeev Arora 🔗 - Momentum Extragradient is Optimal for Games with Cross-Shaped Spectrum (Poster)  link »    The extragradient method has recently gained a lot of attention, due to its convergence behavior on smooth games. In games, the eigenvalues of the Jacobian of the vector field are distributed on the complex plane, exhibiting more convoluted dynamics compared to minimization. In this work, we take a polynomial-based analysis of the extragradient with momentum for optimizing games with \emph{cross-shaped} spectrum on the complex plane. We show two results: first, the extragradient with momentum exhibits three different modes of convergence based on the hyperparameter setup: when the eigenvalues are distributed $(i)$ on the real line, $(ii)$ both on the real line along with complex conjugates, and $(iii)$ only as complex conjugates. Then, we focus on the case $(ii)$, i.e., when the spectrum of the Jacobian has \emph{cross-shaped} structure, as observed in training generative adversarial networks. For this problem class, we derive the optimal parameters and show that the extragradient with momentum achieves accelerated convergence rate. Link » Junhyung Lyle Kim · Gauthier Gidel · Anastasios Kyrillidis · Fabian Pedregosa 🔗 - Stochastic Adaptive Regularization Method with Cubics: A High Probability Complexity Bound (Poster)  link »    We present a high probability complexity bound for a stochastic adaptive regularization method with cubics, also known as regularized Newton method. The method makes use of stochastic zeroth, first and second-order oracles that satisfy certain accuracy and reliability assumptions. Such oracles have been used in the literature by other adaptive stochastic methods, such as trust region and line search. These oracles capture many settings, such as expected risk minimization, stochastic zeroth order optimization, and others. In this paper, we give the first high-probability iteration bound for stochastic cubic regularization and show that just as in the deterministic case, it is superior to other adaptive methods. Link » Katya Scheinberg · Miaolan Xie 🔗 - ProxSkip for Stochastic Variational Inequalities: A Federated Learning Algorithm for Provable Communication Acceleration (Poster)  link »    Recently Mishchenko et al. proposed and analyzed ProxSkip, a provably efficient method for minimizing the sum of a smooth $(f)$ and an expensive nonsmooth proximable $(R)$ function (i.e. $\min_{x \in \mathbb{R}^d} f(x) + R(x)$). The main advantage of ProxSkip, is that in the federated learning (FL) setting, offers provably an effective acceleration of communication complexity.This work extends this approach to the more general regularized variational inequality problems (VIP). In particular, we propose ProxSkip-VIP algorithm, which generalizes the original ProxSkip framework to VIP, and we provide convergence guarantees for a class of structured non-monotone problems. In the federated learning setting, we explain how our approach achieves acceleration in terms of the communication complexity over existing state-of-the-art FL algorithms. Link » Siqi Zhang · Nicolas Loizou 🔗 - DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD (Poster)  link »    Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models at the expense of inferior generalization performance. In this work, we proposed a Dimension-Reduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trust-region like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate according to some loss curvature, and indeed decomposes the $n$-dimensional gradient into $n$ standard basis directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along $k\ (\ll \! n)$ descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for $k$ descent directions by solving a trust-region subproblem whose computational overhead is negligible since the trust-region subproblem is low-dimensional, e.g. $k=2$ in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyper-parameters and with negligible extra computation. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG. Link » Jingyang Li · Pan Zhou · Kuangyu Ding · Kim-Chuan Toh · Yinyu Ye 🔗 - Fast Convergence of Greedy 2-Coordinate Updates for Optimizing with an Equality Constraint (Poster)  link »    In this work, we study the Block Coordinate Descent (BCD) algorithm with greedy block selection rule for minimizing functions with one linear equality constraint. We show a faster linear convergence rate for BCD with block size 2 (2-BCD) on functions satisfying the Polyak-Lojasiewicz inequality. Our analysis exploits similarity between the solutions of 2-BCD and equality-constrained steepest descent (SD) in the L1-norm. This yields a simple proof. Link » Amrutha Varshini Ramesh · Aaron Mishkin · Mark Schmidt 🔗 - Quadratic minimization: from conjugate gradients to an adaptive heavy-ball method with Polyak step-sizes (Poster)  link »    In this work, we propose an adaptive variation on the classical heavy-ball method for convex quadratic minimization. The adaptivity crucially relies on so-called Polyak step-sizes'', which consists in using the knowledge of the optimal value of the optimization problem at hand instead of problem parameters such as a few eigenvalues of the Hessian of the problem. This method happens to also be equivalent to a variation of the classical conjugate gradient method, and thereby inherits many of its attractive features, including its finite-time convergence, instance optimality, and its worst-case convergence rates.The classical gradient method with Polyak step-sizes is known to behave very well in situations in which it can be used, and the question of whether incorporating momentum in this method is possible and can improve the method itself appeared to be open.We provide a definitive answer to this question for minimizing convex quadratic functions, a arguably necessary first step for developing such methods in more general setups. Link » Baptiste Goujaud · Adrien Taylor · Aymeric Dieuleveut 🔗 - NCVX: A General-Purpose Optimization Solver for Constrained Machine and Deep Learning (Poster)  link »    Imposing explicit constraints are new and increasingly trendy in deep learning, stimulated by, e.g., trustworthy AI that performs robust optimization over complicated perturbation sets and scientific applications that need to respect physical laws and constraints. However, it can be hard to reliably solve constrained deep learning problems without optimization expertise. Existing deep learning frameworks do not admit constraints. General-purpose optimization packages can handle constraints but do not perform auto-differentiation and have trouble dealing with nonsmoothness. In this paper, we introduce a new software package called NCVX, whose initial release contains the solver PyGRANSO, a PyTorch-enabled general-purpose optimization package for constrained machine and deep learning problems, the first of its kind. NCVX inherits auto-differentiation, GPU acceleration, and tensor variables from PyTorch, is built on freely available and widely used open- source frameworks. NCVX is available at https://ncvx.org, with detailed documentation and numerous examples from machine/deep learning and other fields. Link » Buyun Liang · Tim Mitchell · Ju Sun 🔗 - A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets (Poster)  link »    Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated each ReLU. This alternative \emph{(and effectively equivalent)} regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the \emph{sparse} solutions it shares with standard weight decay training. Link » Liu Yang · Jifan Zhang · Joseph Shenouda · Dimitris Papailiopoulos · Kangwook Lee · Robert Nowak 🔗 - Policy gradient finds global optimum of nearly linear-quadratic control systems (Poster)  link »    We explore reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic control systems. In particular, we consider a dynamic system composed of the summation of a linear and a nonlinear components, which is governed by a policy with the same structure. Assuming that the nonlinear part consists of kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. While the resulting landscape is generally nonconvex, we show local strong convexity and smoothness of the cost function around the global optimizer. In addition, we design a policy gradient algorithm with a carefully chosen initialization and prove that the algorithm is guaranteed to converge to the globally optimal policy with a linear rate. Link » Yinbin Han · Meisam Razaviyayn · Renyuan Xu 🔗 - Relating Regularization and Generalization through the Intrinsic Dimension of Activations (Poster)  link »    Given a pair of models with similar training set performance, it is natural to assume that the model that possesses simpler internal representations would exhibit better generalization. In this work, we provide empirical evidence for this intuition through an analysis of the intrinsic dimension (ID) of model activations, which can be thought of as the minimal number of factors of variation in the model's representation of the data. First, we show that common regularization techniques uniformly decrease the last-layer ID (LLID) of validation set activations for image classification models and show how this strongly affects generalization performance. We also investigate how excessive regularization decreases a model's ability to extract features from data in earlier layers, leading to a negative effect on validation accuracy even while LLID continues to decrease and training accuracy remains near-perfect. Finally, we examine the LLID over the course of training of models that exhibit grokking. We observe that well after training accuracy saturates, when models "grok" and validation accuracy suddenly improves from random to perfect, there is a co-occurent sudden drop in LLID, thus providing more insight into the dynamics of sudden generalization. Link » Bradley Brown · Jordan Juravsky · Anthony Caterini · Gabriel Loaiza-Ganem 🔗 - The Importance of Temperature in Multi-Task Optimization (Poster)  link »    The promise of multi-task learning is that optimizing a single model on multiplerelated tasks will lead to a better solution for all tasks than independently trained models.In practice, optimization difficulties, such as conflicting gradients, can result in negative transfer, where multi-task models which perform worse than single-task models.In this work, we identify the optimization temperature---the ratio of learning rate to batch size---asa key factor in negative transfer.Temperature controls the level of noise in each optimization step, which prior work has shown to havea strong correlation with generalization.We demonstrate that, in some multi-task settings, negative transfer may arise due to poorly set optimization temperature,rather than inherently high task conflict.The implication of this finding is that in some settings, SGD with a carefully controlledtemperature achieves comparable, and in some cases superior, performance tothat of specialized optimization procedures such as PCGrad, MGDA, and GradNorm.In particular, our results suggest that the significant additional computational burden of these specialized methods may not always be necessary.Finally, we observe a conflict between the optimal temperatures of different tasks in amulti-task objective, with different levels of noise promoting better generalization for differenttasks.Our work suggests the need for novel multi-task optimization methods which considerindividual task noise-levels, and their impact on generalization. Link » David Mueller · Mark Dredze · Nicholas Andrews 🔗 - Network Pruning at Scale: A Discrete Optimization Approach (Poster)  link »    Due to the ever-growing size of neural network models, there has been an emerging interest in compressing (i.e., pruning) neural networks by sparsifying weights in a pre-trained neural network, while maintaining the performance of dense model as much as possible. In this work, we focus on a neural network pruning framework based on local quadratic models of the loss function. We present an optimization-based approach with an $\ell_0$-regression formulation, and propose novel algorithms to obtain good solutions to the combinatorial optimization problem. In practice, our basic (single-stage) approach, based on one local quadratic model approximation, is up to $10^3$ times faster than existing methods while achieving similar accuracy. We also propose a multi-stage method that outperforms other methods in terms of accuracy for a given sparsity constraint while remaining computationally efficient. In particular, our approach results in a 98\% sparse (i.e., 98\% of weights in dense model are set to zero) MLPNet with 90\% test accuracy (i.e., 3\% reduction in test accuracy compared to the dense model), which is an improvement over the previous best accuracy (55\%). Link » Wenyu Chen · Riade Benbaki · Xiang Meng · Rahul Mazumder 🔗 - A Novel Stochastic Gradient Descent Algorithm for LearningPrincipal Subspaces (Poster)  link » In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can bescaled to datasets with an effectively infinite number of rows and columns. Our method consistsin defining a loss function whose minimizer is the desired principal subspace, and constructing agradient estimate of this loss whose bias can be controlled. Link » Charline Le Lan · Joshua Greaves · Jesse Farebrother · Mark Rowland · Fabian Pedregosa · Rishabh Agarwal · Marc Bellemare 🔗 - Nonsmooth Composite Nonconvex-Concave Minimax Optimization (Poster)    Nonconvex-concave minimax optimization has received intense interest in machine learning, including learning with robustness to data distribution, learning with non-decomposable loss, adversarial learning, to name a few. Nevertheless, most existing works focus on the gradient-descent-ascent (GDA) variants that can only be applied in smooth settings. In this paper, we consider a family of minimax problems whose objective function enjoys the nonsmooth composite structure in the variable of minimization and is concave in the variables of maximization. By fully exploiting the composite structure, we propose a smoothed proximal linear descent ascent (\textit{smoothed} PLDA) algorithm and further establish its $\mathcal{O}(\epsilon^4)$ iteration complexity, which matches that of smoothed GDA~\cite{zhang2020single} under smooth settings. Moreover, under the mild assumption that the objective function satisfies the one-sided Kurdyka-Lojasiewicz condition with exponent $\theta \in (0,1)$, we can further improve the iteration complexity to $\mathcal{O}(\epsilon^{-2 \max\{2 \theta, 1\}})$. To the best of our knowledge, this is the first provably efficient algorithm for nonsmooth nonconvex-concave problems that can achieve the optimal iteration complexity $\mathcal{O}(\epsilon^{-2})$ if $\theta \in (0, 1/2]$. Jiajin Li · Linglingzhi Zhu · Anthony Man-Cho So 🔗 - Decentralized Stochastic Optimization with Client Sampling (Poster)  link »    Decentralized optimization is a key setting toward enabling data privacy and on-device learning over networks.Existing research primarily focuses on distributing the objective function across $n$ nodes/clients, lagging behind the real-world challenges such as i) node availability---not all $n$ nodes are always available during the optimization---and ii) slow information propagation (caused by a large number of nodes $n$). In this work, we study Decentralized Stochastic Gradient Descent (D-SGD) with node subsampling, i.e. when only $s~(s \leq n)$ nodes are randomly sampled out of $n$ nodes per iteration.We provide the theoretical convergence rates in smooth (convex and non-convex) problems with heterogeneous (non-identically distributed data) functions.Our theoretical results capture the effect of node subsampling and choice of the topology on the sampled nodes, through a metric termed \emph{the expected consensus rate}.On a number of common topologies, including ring and torus, we theoretically and empirically demonstrate the effectiveness of such a metric. Link » Ziwei Liu · Anastasiia Koloskova · Martin Jaggi · Tao Lin 🔗 - Escaping from Moderately Constrained Saddles (Poster)  link » We give polynomial time algorithms for escaping from high-dimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function $f \colon \mathbb R^d \to \mathbb R$ we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. This constitutes progress (without reliance on NP-oracles or altering the definitions to only account for certain constraints) on the main open question of the breakthrough work of Ge et al. who showed an analogous result for unconstrained and equality-constrained problems. Our results hold for both regular and stochastic gradient descent. Link » Dmitrii Avdiukhin · Grigory Yaroslavtsev 🔗 - Uniform Convergence and Generalization for Nonconvex Stochastic Minimax Problems (Poster)  link »    This paper studies the uniform convergence and generalization bounds for nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization. We first establish the uniform convergence between the empirical minimax problem and the population minimax problem and show the $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ sample complexities respectively for the NC-SC and NC-C settings, where $d$ is the dimension number and $\kappa$ is the condition number. To the best of our knowledge, this is the first uniform convergence result measured by the first-order stationarity in stochastic minimax optimization literature. Link » Siqi Zhang · Yifan Hu · Liang Zhang · Niao He 🔗 - Semi-Random Sparse Recovery in Nearly-Linear Time (Poster)  link »    Sparse recovery is one of the most fundamental and well-studied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearly-linear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various real-world settings.We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semi-random adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$-sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + \xi$, we design algorithms recovering $x^\star$ information-theoretically optimally in nearly-linear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP row subset assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time.Our approach differs from that of prior fast iterative methods with provable guarantees under semi-random generative models (Cheng-Ge '18, Li et al. '20), which typically separate the problem of learning the planted instance from the estimation problem, i.e. they attempt to first learn the planted "good" instance (in our case, $\mathbf{G}$). However, natural conditions which make sparse recovery tractable, such as RIP, are NP-hard to verify and hence first learning a sufficient row reweighting appears challenging. We eschew this approach and design a new iterative method, tailored to the geometry of sparse recovery, which is provably robust to our semi-random model. We hope our approach opens the door to new robust, efficient algorithms for other natural statistical inverse problems. Link » Jonathan Kelner · Jerry Li · Allen Liu · Aaron Sidford · Kevin Tian 🔗 - Generalization of Decentralized Gradient Descent with Separable Data (Poster)  link » Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, we conduct numerical experiments which corroborate our theoretical results. Link » Hossein Taheri · Christos Thrampoulidis 🔗 - Gradient dynamics of single-neuron autoencoders on orthogonal data (Poster)  link »    In this work we investigate the dynamics of (stochastic) gradient descent when training a single-neuron ReLU autoencoder on orthogonal inputs. We show that for this non-convex problem there exists a manifold of global minima all with the same maximum Hessian eigenvalue and that gradient descent reaches a particular global minimum when initialized randomly. Interestingly, which minimum is reached depends heavily on the batch-size. For full batch gradient descent, the directions of the neuron that are initially positively correlated with the data are merely rescaled uniformly, hence in high-dimensions the learned neuron is a near uniform mixture of these directions. On the other hand, with batch-size one the neuron exactly aligns with a single such direction, showing that when using a small batch-size a qualitatively different type of feature selection" occurs. Link » Nikhil Ghosh · Spencer Frei · Wooseok Ha · Bin Yu 🔗 - A Variable-Coefficient Nuclear Norm Penalty for Low Rank Inference (Poster)  link »    Low rank structure is expected in many applications, so it is often desirable to be able to specify cost functions that induce low rank.A common approach is to augment the cost with a penalty function approximating the rank function, such as the nuclear norm which is given by the $\ell_1$ norm of the matrix's singular values.This has the advantage of being a convex function, but it biases matrix entries towards zero.On the other hand, nonconvex approximations to the rank function can make better surrogates but invariably introduce additional hyperparameters.In this article, we instead study a weighted nuclear norm approach with learnable weights which provides the behavior of nonconvex penalties without introducing any additional hyperparameters.This approach can also benefit from the fast proximal methods which make nuclear norm approaches scalable.We demonstrate the potential of this technique by comparing it against the standard nuclear norm approach on synthetic and realistic matrix denoising and completion problems.We also outline the future work necessary to deploy this algorithm to large scale problems. Link » Nathan Wycoff · Ali Arab · Lisa Singh 🔗 - Exact Gradient Computation for Spiking Neural Networks (Poster)  link »    Spiking neural networks (SNNs) have recently emerged as an alternative to traditional neural networks, holding promise for energy efficiency benefits. However, the classic backpropagation algorithm for training traditional networks has been notoriously difficult to apply to SNNs due to the hard-thresholding and discontinuities at spike times. Therefore, a large majority of prior work believes that exact gradients for SNN w.r.t. their weights do not exist and has focused on approximation methods to produce surrogate gradients. In this paper, (1)\,by applying the implicit function theorem to SNN at the discrete spike times, we prove that, albeit being non-differentiable in time, SNNs have well-defined gradients w.r.t. their weights, and (2)\,we propose a novel training algorithm, called \emph{forward propagation} (FP), that computes exact gradients for SNNs. Our derivation of FP in this paper provides insights on why other related algorithms such as Hebbian learning and also recently-proposed surrogate gradient methods may perform well. Link » Jane Lee · Saeid Haghighatshoar · Amin Karbasi 🔗 - Linear Convergence Analysis of Neural Collapse with Unconstrained Features (Poster)  link »    In this work, we study the recently discovered neural collapse (NC) phenomenon, which is prevalent in training over-parameterized deep neural networks for classification tasks. Existing work has shown that any optimal solution of the trained problem for classification tasks is an NC solution and has a benign landscape under the unconstrained feature model. However, these results do not provide an answer to the question of how quickly gradient descent can find an NC solution. To answer this question, under the unconstrained feature model we prove an error bound property of the trained loss, which refers to the inequality that bounds the distance of a point in the optimal solution set by the norm of its gradient. Using this error bound, we can show linear convergence of gradient descent for finding an NC solution. Link » Peng Wang · Huikang Liu · Can Yaras · Laura Balzano · Qing Qu 🔗 - The Solution Path of the Group Lasso (Poster)  link »    We prove continuity of the solution path for the group lasso, a popular method of computing group- sparse models. Unlike the more classical lasso method, the group lasso solution path is non-linear and cannot be determined in closed form. To circumvent this, we first characterize the group lasso solution set and then show how to construct an implicit function for the min-norm path. We prove our implicit representation is continuous almost everywhere and extend this to continuity every- where when the group lasso solution is unique. Our work can be viewed as extending solution path analyses from lasso setting to the group lasso and implies that grid-search is a sensible approach to hyper-parameter selection. Our results also have applications to convex reformulations of neural neural networks and so are deeply connected to solution paths for shallow neural networks. Link » Aaron Mishkin · Mert Pilanci 🔗 - Conditional gradient-based method for bilevel optimization with convex lower-level problem (Poster)  link »    In this paper, we study simple bilevel optimization problems, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are not satisfactory as they are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a conditional gradient-based (CG-based) method to solve the considered problem. The main idea is to locally approximate the solution set of the lower-level problem via a cutting plane, and then run a CG-type update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered bilevel problem. Link » Ruichen Jiang · Nazanin Abolfazli · Aryan Mokhtari · Erfan Yazdandoost Hamedani 🔗 - Sufficient conditions for non-asymptotic convergence of Riemannian optimization methods (Poster)  link »    Motivated by energy based analyses for descent methods in the Euclidean setting, we investigate a generalisation of such analyses for descent methods over Riemannian manifolds. In doing so, we find that it is possible to derive curvature-free guarantees for such descent methods. This also enables us to give the first known guarantees for a Riemannian cubic-regularised Newton algorithm over g-convex functions, which extends the guarantees by Agarwal et al for an adaptive Riemannian cubic-regularised Newton algorithm over general non-convex functions. This analysis motivates us to study acceleration of Riemannian gradient descent in the g-convex setting, and we improve on an existing result by Alimisis et al, albeit with a curvature-dependent rate. Finally, extending the analysis by Ahn and Sra, we attempt to provide some sufficient conditions for the acceleration of Riemannian descent methods in the strongly geodesically convex setting. Link » Vishwak Srinivasan · Ashia Wilson 🔗 - A Neural Tangent Kernel Perspective on Function-Space Regularization in Neural Networks (Poster)  link »    Loss regularization can help reduce the gap between training and test error by systematically limiting model complexity. Popular regularization techniques such as L2 weight regularization act directly on the network parameters, but do not explicitly take into account how the interplay between the parameters and the network architecture may affect the induced predictive functions.To address this shortcoming, we propose a simple technique for effective function-space regularization. Drawing on the result that fully-trained wide multi-layer perceptrons are equivalent to kernel regression under the Neural Tangent Kernel (NTK), we propose to approximate the norm of neural network functions by the reproducing kernel Hilbert space norm under the NTK and use it as a function-space regularizer. We prove that neural networks trained using this regularizer are arbitrarily close to kernel ridge regression solutions under the NTK. Furthermore, we provide a generalization error bound under the proposed regularizer and empirically demonstrate improved generalization and state-of-the-art performance on downstream tasks where effective regularization on the induced space of functions is essential. Link » Zonghao Chen · Xupeng Shi · Tim G. J. Rudner · Qixuan Feng · Weizhong Zhang · Tong Zhang 🔗 - Annealed Training for Combinatorial Optimization on Graphs (Poster)  link »    Learning neural networks for CO problems is notoriously difficult given the lack of labeled data as the training gets trapped easily at local optima. However, the hardness of combinatorial optimization (CO) problems hinders collecting solutions for supervised learning. We propose a simple but effective unsupervised annealed training framework for CO problems in this work. In particular, we transform CO problems into unbiased energy-based models (EBMs). We carefully selected the penalties terms to make the EBMs as smooth as possible. Then we train graph neural networks to approximate the EBMs and we introduce an annealed loss function to prevent the training from being stuck at local optima near the initialization.An experimental evaluation demonstrates that our annealed training framework obtains substantial improvements. In four types of CO problems, our method achieves performance substantially better than other unsupervised neural methods on both synthetic and real-world graphs. Link » Haoran Sun · Etash Guha · Hanjun Dai 🔗 - Clairvoyant Regret Minimization: Equivalence with Nemirovski’s Conceptual Prox Method and Extension to General Convex Games (Poster)  link »    A recent paper by Piliouras et al. introduces an uncoupled learning algorithm for normal-form games---called Clairvoyant MWU (CMWU). In this paper we show that CMWU is equivalent to the conceptual prox method described by Nemirovski. This connection immediately shows that it is possible to extend the CMWU algorithm to any convex game, a question left open by Piliouras et al. We call the resulting algorithm---again equivalent to the conceptual prox method---Clairvoyant OMD. At the same time, we show that our analysis yields an improved regret bound compared to the original bound by Piliouras et al., in that the regret of CMWU scales only with the square root of the number of players, rather than the number of players themselves. Link » Gabriele Farina · Christian Kroer · Chung-Wei Lee · Haipeng Luo 🔗 - Parameter Free Dual Averaging: Optimizing Lipschitz Functions in a Single Pass (Poster)  link »    Both gradient descent and dual averaging for convex Lipschitz functionshave convergence rates that are highly dependent on the choice of learningrate. Even when the Lipschitz constant is known, setting the learning rate to achieve the optimal convergence rate requires knowing a bound on the distance from the initial point to the solution set $D$. A numberof approaches are known that relax this requirement, but they eitherrequire line searches, restarting (hyper-parameter grid search), or do not derivefrom the gradient descent or dual averaging frameworks (coin-betting).In this work we describe a single pass method, with no back-tracking or line searches, derived from dual averaging,which does not require knowledge of $D$ yet asymptotically achievesthe optimal rate of convergence for the complexity class of ConvexLipschitz functions. Link » Aaron Defazio · Konstantin Mishchenko 🔗 - Differentially Private Adaptive Optimization with Delayed Preconditioners (Poster)  link »    Privacy costs may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4× relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data. Link » Tian Li · Manzil Zaheer · Ken Liu · Sashank Reddi · H. Brendan McMahan · Virginia Smith 🔗 - Differentially Private Adaptive Optimization with Delayed Preconditioners (Oral)  link » Privacy costs may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP^2 across several real-world datasets, demonstrating that it can improve convergence speed by as much as 4× relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data. Link » Tian Li · Manzil Zaheer · Ken Liu · Sashank Reddi · H. Brendan McMahan · Virginia Smith 🔗 - A Light-speed Linear Program Solver for Personalized Recommendation with Diversity Constraints (Poster)  link »    We study a structured linear program (LP) that emerges in the need of ranking candidates or items in personalized recommender systems. Since the candidate set is only known in real time, the LP also needs to be solved in real time. Latency and user experience are major considerations, requiring the LP to be solved within just a few milliseconds. Although typical instances of the problem are not very large in size, this stringent time limit is beyond the capability of most existing commercial LP solvers, which can take 20 milliseconds or more to find a solution. Thus, reliable methods that address the real-world complication of latency become necessary.In this paper, we propose a fast specialized LP solver for a structured problem with diversity constraints.Our method solves the dual problem, making use of the piece-wise affine structure of the dual objective function, with an additional screening technique that helps reduce the dimensionality of the problem as the algorithm progresses. Experiments reveal that our method can solve the problem within roughly 1 millisecond, yielding a 20x improvement in speed over the most performant standard LP solvers. This speed-up can help improve the quality of recommendations without affecting user experience, highlighting how optimization can provide solid orthogonal value to machine-learned recommender systems. Link » Miao Cheng · Haoyue Wang · Aman Gupta · Rahul Mazumder · Sathiya Selvaraj · Kinjal Basu 🔗 - adaStar: A Method for Adapting to Interpolation (Poster)  link » Stochastic convex optimization methods are much faster at minimizing interpolation problems—problems where all sample losses share a common minimizer—than non-interpolating problems. However, stochastic gradient methods need to use step sizes tailored for the interpolation setting, which are sub-optimal for non-interpolating problems, to attain these fast rates. This is problematic because verifying whether a problem is interpolating, without minimizing it, is difficult. Moreover, because interpolation is not a stable property—the addition of a single datapoint can transform an interpolating dataset into a non-interpolating one—we would like our methods to get the fast interpolation rate when it can while being robust to these perturbations. We address these problems by presenting an adaptive stochastic gradient method, termed adaStar, which attains the optimal, fast rate on smooth interpolation problems (up to log factors) and gracefully degrades with the minimal objective value for non-interpolating problems. We use this method as a building block to construct another stochastic gradient method, termed adaStar-G which adapts to interpolation and growth conditions, getting even faster rates. Link » Gary Cheng · John Duchi 🔗 - PyEPO: A PyTorch-based End-to-End Predict-then-Optimize Library with Linear Objective Function (Poster)  link »    In many practical settings, some parameters of an optimization problem may be a priori unknown but can be estimated from historical data. Recently, end-to-end predict-then-optimize has emerged as an attractive alternative to the two-stage approach of separately fitting a predictive model for the unknown parameters, then optimizing. In this work, we present the PyEPO package, a PyTorch-based end-to-end predict-then-optimize library in Python for linear and integer programming. It provides two base algorithms: the first is based on the convex surrogate loss function from the seminal work of Elmachtoub & Grigas (2021), and the second is based on the differentiable black-box solver approach of Vlastelica et al. (2019). PyEPO provides a simple interface for the definition of new optimization problems, the implementation of state-of-the-art predict-then-optimize training algorithms, the use of custom neural network architectures, and the comparison of end-to-end approaches with the two-stage approach. Link » Bo Tang · Elias Khalil 🔗 - Accelerating Perturbed Stochastic Iterates in Asynchronous Lock-Free Optimization (Poster)  link »    We show that stochastic acceleration can be achieved under the perturbed iterate framework (Mania et al., 2017) in asynchronous lock-free optimization, which leads to the optimal incremental gradient complexity for finite-sum objectives. We prove that our new accelerated method requires the same linear speed-up condition as existing non-accelerated methods. Our key algorithmic discovery is a new accelerated SVRG variant with sparse updates. Empirical results are presented to verify our theoretical findings. Link » Kaiwen Zhou · Anthony Man-Cho So · James Cheng 🔗 - Neural DAG Scheduling via One-Shot Priority Sampling (Poster)  link » We consider the problem of scheduling operations/nodes, the dependency among which is characterized by a Directed Acyclic Graph (DAG). Due to its NP-hard nature, heuristic algorithms were traditionally used to acquire reasonably good solutions, and more recent works have proposed Machine Learning (ML) heuristics that can generalize to unseen graphs and outperform the non-ML heuristics. However, it is computationally costly to generate solutions using existing ML schedulers since they adopt the episodic reinforcement learning framework that necessitates multi-round neural network processing. We propose a novel ML scheduler that uses a one-shot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules. Since the one-shot encoder can efficiently sample the priorities in parallel, our algorithm runs significantly faster than existing ML baselines and has comparable run time with the fast traditional heuristics. We empirically show that our algorithm generates better schedules than both non-neural and neural baselines across various real-world and synthetic scheduling tasks. Link » Wonseok Jeon · Mukul Gagrani · Burak Bartan · Weiliang Zeng · Harris Teague · Piero Zappi · Christopher Lott 🔗 - Stochastic Gradient Estimator for Differentiable NAS (Poster)  link » Neural architecture search (NAS) has recently attracted more attention due to its ability to design deep neural networks automatically. Differentiable NAS methods have predominated due to their search efficiency. However, differentiable NAS methods consistently adopt approximate gradient-based methods to solve bilevel optimization problems. While second derivative approximation optimizes Jacobian or/and Hessian vector computation, it is imprecise and time-consuming in practice. In this paper, we revisit the hypergradient of bilevel optimization problems in NAS, then propose a new optimizer based on a stochastic gradient estimator(SGE) for the computation of the Jacobian matrix in the hypergradient. The SGE is adaptable to previous differentiable NAS methods and eliminates the second-order computation in the optimization process. In the experiments on commonly differentiable NAS benchmarks, the proposed SGE-NAS algorithm outperforms the baseline algorithm. The test result demonstrates that the proposed SGE-NAS can effectively reduce search time and find the model with higher classification performance. Link » Libin Hou · Linyuan Wang · Qi Peng · Bin Yan 🔗 - Near-optimal decentralized algorithms for network dynamic optimization (Poster)  link »    We study dynamic decision-making problems in networks under stochastic uncertainty about future payoffs. The network has a bounded degree, and each node takes a discrete decision at each period, leading to a per-period payoff that is a sum of three parts: node rewards for individual node decisions, temporal interactions between individual node decisions from the current and previous periods, and spatial interactions between decisions from pairs of neighboring nodes. The objective is to maximize the expected total payoffs over a finite horizon. We propose a decentralized algorithm whose computational requirement is linear in the graph size and planning horizon, and characterize sufficient conditions under which our decentralized algorithm achieves near optimality compared to the centralized global optimal. The class of decentralized algorithms is parameterized by locality parameter $L$. An $L$-local algorithm makes its decision at each node $v$ based on current and (simulated) future payoffs only up to $L$ periods ahead, and only in an $L$-radius neighborhood around $v$. Given any permitted error $\epsilon > 0$, with $L = O(\log(1/\epsilon))$, we show that $L$-local algorithm has an average per-node-per-period optimality loss of up to $\epsilon$ when temporal and spatial interactions are relatively small compared to the randomness in the node rewards and the graph degree. Link » Judy Gan · Yashodhan Kanoria · Xuan Zhang 🔗 - A Second-order Regression Model Shows Edge of Stability Behavior (Poster)  link »    Recent studies of learning algorithms have shown that there is a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). We consider a class of predictive models that are quadratic in the parameters, which we call second-order regression models. This is in contrast with the neural tangent kernel regime, where the predictive function is linear in the parameters. For quadratic objectives in two dimensions, we prove that this second order regression model exhibits both progressive sharpening and edge of stability behavior. We then show that in higher dimensions, the model shows this behavior generically without the structure of a neural network, due to a non-linearity induced in the learning dynamics. Finally, we show that edge of stability behavior in neural networks is correlated with the behavior in quadratic regression models. Link » Fabian Pedregosa · Atish Agarwala · Jeffrey Pennington 🔗 - Online Min-max Optimization: Nonconvexity, Nonstationarity, and Dynamic Regret (Poster)  link »    Online min-max optimization has recently gained considerable interest due to its rich applications to game theory, multi-agent reinforcement learning, online robust learning, etc. Theoretical understanding in this field has been mainly focused on convex-concave settings. Online min-max optimization with nonconvex geometries, which captures various online deep learning problems, has yet been studied so far. In this paper, we make the first effort and investigate online nonconvex-strongly-concave min-max optimization in the nonstationary environment. We first introduce a natural notion of dynamic Nash equilibrium (NE) regret, and then propose a novel algorithm coined SODA to achieve the optimal regret. We further generalize our study to the setting with stochastic first-order feedback, and show that a variation of SODA can also achieve the same optimal regret in expectation. Our theoretical results and the superior performance of the proposed method are further validated by empirical experiments. To our best knowledge, this is the first exploration of efficient online nonconvex min-max optimization. Link » Yu Huang · Yuan Cheng · Yingbin Liang · Longbo Huang 🔗 - Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization (Poster)  link »    Modern deep learning models are over-parameterized, where the optimization setup strongly affects the generalization performance. A key element of reliable optimization for these systems is the modification of the loss function. Sharpness-Aware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as mSAM, which, during training, averages the updates generated by adversarial perturbations across several disjoint shards of a mini-batch. Recent work suggests that mSAM can outperform SAM in terms of test accuracy. However, a comprehensive empirical study of mSAM is missing from the literature---previous results have mostly been limited to specific architectures and datasets. To that end, this paper presents a thorough empirical evaluation of mSAM on various tasks and datasets. We provide a flexible implementation of mSAM and compare the generalization performance of mSAM to the performance of SAM and vanilla training on different image classification and natural language processing tasks. We also conduct careful experiments to understand the computational cost of training with mSAM, its sensitivity to hyperparameters and its correlation with the flatness of the loss landscape. Our analysis reveals that mSAM yields superior generalization performance and flatter minima, compared to SAM, across a wide range of tasks without significantly increasing computational costs. Link » Kayhan Behdin · Qingquan Song · Aman Gupta · Sathiya Selvaraj · David Durfee · Ayan Acharya · Rahul Mazumder 🔗 - Reducing Communication in Nonconvex Federated Learning with a Novel Single-Loop Variance Reduction Method (Poster)  link »    In Federated Learning (FL), inter-client heterogeneity causes two types of errors: (i) \emph{client drift error} which is induced by multiple local updates, (ii) \emph{client sampling error} due to partial participation of clients at each communication. While several solutions have been offered to the former one, there is still much room of improvement on the latter one.We provide a fundamental solution to this client sampling error. The key is a novel single-loop variance reduction algorithm, SLEDGE (Single-Loop mEthoD for Gradient Estimator), which does not require periodic computation of full gradient but achieves optimal gradient complexity in the nonconvex finite-sum setting. While sampling a small number of clients at each communication round, the proposed FL algorithm, FLEDGE, requires provably fewer or at least equivalent communication rounds compared to any existing method, for finding first and even second-order stationary points in the general nonconvex setting, and under the PL condition. Moreover, under less Hessian-heterogeneity between clients, the required number of communication rounds approaches to $\tilde{\Theta}(1)$. Link » Kazusato Oko · Shunta Akiyama · Tomoya Murata · Taiji Suzuki 🔗 - Fast Convergence of Random Reshuffling under Interpolation and the Polyak-Łojasiewicz Condition (Poster)  link »    Modern machine learning models are often over-parameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling- without-replacement variant of Stochastic Gradient Descent (SGD), known as Random Reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch. For under-parameterized models, it has been recently shown that RR converges faster than SGD when the number of epochs is larger than the condition number (κ) of the problem under standard assumptions like strong convexity. However, previous works do not show that RR outperforms SGD under interpolation for strongly convex objectives. Here, we show that for the class of Polyak-Łojasiewicz (PL) functions that generalizes strong convexity, RR can outperform SGD as long as the number of samples (n) is less than the parameter (ρ) of a strong growth condition (SGC). Link » Chen Fan · Christos Thrampoulidis · Mark Schmidt 🔗 - ZerO Initialization: Initializing Neural Networks with only Zeros and Ones (Poster)  link » Deep neural networks are usually initialized with random weights, with adequately selected initial variance to ensure stable signal propagation during training. However, selecting the appropriate variance becomes challenging especially as the number of layers grows. In this work, we replace random weight initialization with a fully deterministic initialization scheme, viz., ZerO, which initializes the weights of networks with only zeros and ones (up to a normalization factor), based on identity and Hadamard transforms. Through both theoretical and empirical studies, we demonstrate that ZerO is able to train networks without damaging their expressivity. Applying ZerO on ResNet achieves state-of-the-art performance on various datasets, including ImageNet, which suggests random weights may be unnecessary for network initialization. In addition, ZerO has many benefits, such as training ultra deep networks (without batch-normalization), exhibiting low-rank learning trajectories that result in low-rank and sparse solutions, and improving training reproducibility. Link » Jiawei Zhao · Florian Schaefer · Anima Anandkumar 🔗 - Differentially Private Federated Learning with Normalized Updates (Poster)  link »    The customary approach for client-level differentially private federated learning (FL) is to add Gaussian noise to the average of the clipped client updates. Clipping is associated with the following issue: as the client updates fall below the clipping threshold, they get drowned out by the added noise, inhibiting convergence. To mitigate this issue, we propose replacing clipping with normalization, where we use only a scaled version of the unit vector along the client updates. Normalization ensures that the noise does not drown out the client updates even when the original updates are small. We theoretically show that the resulting normalization-based private FL algorithm attains better convergence than its clipping-based counterpart on convex objectives in over-parameterized settings. Link » Rudrajit Das · Abolfazl Hashemi · Sujay Sanghavi · Inderjit Dhillon 🔗 - Adaptive Inexact Sequential Quadratic Programming via Iterative Randomized Sketching (Poster)  link »    We consider solving nonlinear optimization problems with equality constraints. We propose a randomized algorithm based on sequential quadratic programming (SQP) with a differentiable exact augmented Lagrangian as the merit function. In each SQP iteration, we solve the Newton system inexactly via iterative randomized sketching. The accuracy of the inexact solution and the penalty parameter of the augmented Lagrangian are adaptively controlled in the algorithm to ensure that the inexact random search direction is a descent direction of the augmented Lagrangian. This allows us to establish global convergence almost surely. Moreover, we show that a unit stepsize is admissible for the inexact search direction provided the iterate lies in a neighborhood of the solution. Based on this result, we show that the proposed algorithm exploits local linear convergence. We apply the algorithm on benchmark nonlinear problems in CUTEst test set and on constrained logistic regression with datasets from LIBSVM to demonstrate its superior performance. Link » Ilgee Hong · Sen Na · Mladen Kolar 🔗 - Enhanced Index Tracking via Differentiable Assets Sorting (Poster)  link » Enhanced index tracking (EIT) aims to achieve better performance over a target equity index while maintaining a relatively low tracking error. It can be formulated as a quadratic programming problem, but remains challenging when several practical constraints exist, especially the fixed number of assets in the portfolio. In this work, we propose a new method for enhanced index tracking, subject to common practical constraints, including cardinality, which is based on a novel reparametrisation of portfolio weights integrated with a stochastic optimisation. It can simultaneously tackle asset selection and capital allocation, while being optimised by vanilla gradient descent effectively and efficiently. The proposed method is backtested with S&P 500 and Russell 1000 indices data for over a decade. Empirical results demonstrate its superiority over widely used alternatives. Link » Yuanyuan Liu · Yongxin Yang 🔗 - Learning deep neural networks by iterative linearisation (Poster)  link »    The excellent real-world performance of deep neural networks has received increasing attention. Despite the capacity to overfit significantly, such large models work better than smaller ones. This phenomenon is often referred to as the scaling law by practitioners. It is of fundamental interest to study why the scaling law exists and how it avoids/controls overfitting. One approach has been looking at infinite width limits of neural networks (e.g., Neural Tangent Kernels, Gaussian Processes); however, in practise, these do not fully explain finite networks as their infinite counterparts do not learn features. Furthermore, the empirical kernel for finite networks (i.e., the inner product of feature vectors), changes significantly during training in contrast to infinite width networks. In this work we derive a iterative linearised training method. We justify iterative lineralisation as an interpolation between finite analogs of the infinite width regime, which do not learn features, and standard gradient descent training which does. We show some preliminary results where iterative linearised training works well, noting in particular how much feature learning is required to achieve comparable performance. We also provide novel insights into the training behaviour of neural networks. Link » Adrian Goldwaser · Hong Ge 🔗 - Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability (Poster)  link » Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(\theta)$, is bounded by $2/\eta$, training is "stable" and the training loss decreases monotonically. However, Cohen et al. (2021) recently observed two important phenomena. The first, \emph{progressive sharpening}, is that the sharpness steadily increases throughout training until it reaches the instability cutoff $2/\eta$. The second, \emph{edge of stability}, is that the sharpness hovers at $2/\eta$ for the remainder of training while the loss non-monotonically decreases.We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call \emph{self-stabilization}, is a general property of gradient descent and explains its behavior at the edge of stability.A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows \emph{projected} gradient descent (PGD) under the constraint $S(\theta) \le 2/\eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability. Link » Alex Damian · Eshaan Nichani · Jason Lee 🔗 - Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization (Poster)  link »    Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings. Link » Runlong Zhou · Yuandong Tian · YI WU · Simon Du 🔗 - Solving Constrained Variational Inequalities via a First-order Interior Point-based Method (Poster)  link »    We focus on the open problem to develop a first-order method that can solve constrained Variational Inequality (cVI) problems when given general constraints. We generalize the \textit{alternating direction method of multipliers} (ADMM) method and combine it with interior-point approaches, yielding a first-order method that we refer to as ADMM-based interior-point method for cVIs (ACVI). We provide convergence guarantees for ACVI in two general classes of problems: (i) when the operator is $\xi$-monotone, and (i) when it is monotone, some constraints are active and the game is not purely rotational. When the operator is, in addition, $L$-Lipschitz for the latter case, we match known lower bounds on rates for the gap function of $\mathcal{O}(1/\sqrt{K})$ and $\mathcal{O}(1/K)$ for the last and average iterate, respectively. To our knowledge, this is the first \emph{first}-order interior-point method for the general cVI problem that has a global convergence guarantee. Empirical analyses demonstrate clear advantages of ACVI over common first-order methods. In particular, (i) cyclical behavior is notably reduced as our method approaches the solution from the analytic center, and (ii) unlike projection-based methods that zigzag when near a constraint, ACVI efficiently handles the constraints. Link » Tong Yang · Michael Jordan · Tatjana Chavdarova 🔗 - Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence (Poster)  link »    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum.To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions. Link » Diyuan Wu · Vyacheslav Kungurtsev · Marco Mondelli 🔗 - A Stochastic Prox-Linear Method for CVaR Minimization (Poster)  link »    We develop an instance of the stochastic prox-linear method for minimizing the Conditional Value-at-Risk (CVaR) objective. CVaR is a risk measure focused on minimizing worst-case performance, defined as the average of the top quantile of the losses. In machine learning, such a risk measure is useful to train more robust models. Although the stochastic subgradient method (SGM) is a natural choice for minimizing CVaR objective, we show that the prox-linear algorithm can be used to better exploit the structure of the objective, while still providing a convenient closed form update. We then specialize a general convergence theorem for the prox-linear method to our setting, and show that it allows for a wider selection of step sizes compared to SGM. We support this theoretical finding experimentally, by showing that the performance of stochastic prox-linear is more robust to the choice of step size compared to SGM. Link » Si Yi Meng · Vasileios Charisopoulos · Robert Gower 🔗 - Counterfactual Explanations Using Optimization With Constraint Learning (Poster)  link »    Counterfactual explanations have an invaluable potential to make model predictions more sensible to the users. To increase their adoption in practice, several criteria that counterfactual explanations should adhere to have been put forward in the literature. We propose counterfactual explanations using optimization with constraint learning (CE-OCL), a generic and flexible approach that addresses all these criteria and allows room for further extensions. Specifically, we discuss how we can leverage an optimization with constraint learning framework for the generation of counterfactual explanations, and how components of this framework readily map to the criteria. We also propose two novel modeling approaches to address data manifold closeness and diversity, which are two key criteria for practical counterfactual explanations. We test CE-OCL on several datasets and present our results in a case study. Compared against the current state-of-the-art methods, CE-OCL allows for more flexibility and has an overall superior performance in terms of several evaluation metrics proposed in related work. Link » Donato Maragno · Tabea E. Röber · Ilker Birbil 🔗 - Accelerated Single-Call Methods for Constrained Min-Max Optimization (Poster)  link »    We study first-order methods for constrained min-max optimization. Existing methods either requires two gradient calls or two projections in each iteration, which may be costly in applications. In this paper, we first show that the \emph{Optimistic Gradient (OG)} method, a \emph{single-call single-projection} algorithm, has $O(\frac{1}{\sqrt{T}})$ convergence rate for inclusion problems with operators that satisfy the weak Minty variation inequality (MVI). Our second result is the first single-call single-projection algorithm -- the \emph{Accelerated Reflected Gradient (ARG)} method that achieves the \emph{optimal $O(\frac{1}{T})$} convergence rate for inclusion problems that satisfy negative comonotonicity. Both the weak MVI and negative comonotonicity are well-studied assumptions and capture a rich set of non-convex non-concave min-max optimization problems. Finally, we show that the \emph{Reflected Gradient (RG)} method, another \emph{single-call single-projection} algorithm, has $O(\frac{1}{\sqrt{T}})$ last-iterate convergence rate for constrained convex-concave min-max optimization, answering an open problem of (Hsieh et al., 2019). Link » Yang Cai · Weiqiang Zheng 🔗 - Accelerated Algorithms for Monotone Inclusion and Constrained Nonconvex-Nonconcave Min-Max Optimization (Poster)  link »    We study monotone inclusions and monotone variational inequalities, as well as their generalizations to non-monotone settings. We first show that the \emph{Extra Anchored Gradient (EAG)} algorithm, originally proposed by [Yoon and Ryu, 2021] for unconstrained convex-concave min-max optimization, can be applied to solve the more general problem of Lipschitz monotone inclusion. More specifically, we prove that the EAG solves Lipschitz monotone inclusion problems with an \emph{accelerated convergence rate} of $O(\frac{1}{T})$, which is \emph{optimal among all first-order methods} [Diakonikolas, 2020, Yoon and Ryu, 2021]. Our second result is an {accelerated forward-backward splitting algorithm (AS),} which not only achieves the accelerated $O(\frac{1}{T})$ convergence rate for all monotone inclusion problems, but also exhibits the same accelerated rate for a family of general (non-monotone) inclusion problems that concern negative comonotone operators. As a special case of our second result, AS enjoys the $O(\frac{1}{T})$ convergence rate for solving a non-trivial class of nonconvex-nonconcave min-max optimization problems. Our analyses are based on simple potential function arguments, which might be useful for analysing other accelerated algorithms. Link » Yang Cai · Argyris Oikonomou · Weiqiang Zheng 🔗 - Accelerated Riemannian Optimization: Handling Constraints to Bound Geometric Penalties (Poster)  link »    We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in Hadamard manifolds. Our algorithm enjoys the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works resort to assuming that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods, whose applicability is limited to local optimization and to spaces of constant curvature, respectively. Achieving global and general Riemannian acceleration without iterates assumptively staying in the feasible set was asked as an open question in (Kim & Yang, 2022), which we solve for Hadamard manifolds. In our solution, we show that we can use a linearly convergent algorithm for constrained strongly g-convex smooth problems to implement a Riemannian inexact proximal point operator that we use as a subroutine, which is of independent interest. Link » David Martinez-Rubio · Sebastian Pokutta 🔗 - Gradient Descent: Robustness to Adversarial Corruption (Poster)  link »    Optimization using gradient descent (GD) is a ubiquitous practice in various machine learning problems including training large neural networks. Noise-free GD and stochastic GD--corrupted by random noise--have been extensively studied in the literature, but less attention has been paid to an adversarial setting, that is subject to adversarial corruptions in the gradient values. In this work, we analyze the performance of GD under a proposed general adversarial framework. For the class of functions satisfying the Polyak-Łojasiewicz condition, we derive finite time bounds on a minimax optimization error. Based on this bound, we provide a guideline on the choice of learning rate sequence with theoretical guarantees on the robustness of GD against adversarial corruption. Link » Fu-Chieh Chang · Farhang Nabiei · Pei-Yuan Wu · Alexandru Cioba · Sattar Vakili · Alberto Bernacchia 🔗 - Boosting as Frank-Wolfe (Poster)  link »    Some boosting algorithms, such as LPBoost, ERLPBoost, and C-ERLPBoost, aim to solve the soft margin optimization problem with the $\ell_1$-norm regularization. LPBoost rapidly converges to an $\epsilon$-approximate solution in practice, but it is known to take $\Omega(m)$ iterations in the worst case, where $m$ is the sample size.On the other hand, ERLPBoost and C-ERLPBoost are guaranteed to converge to an $\epsilon$-approximate solution in $O(\frac{1}{\epsilon^2} \ln \frac{m}{\nu})$ iterations. However, the computation per iteration is very high compared to LPBoost. To address this issue, we propose a generic boosting scheme that combines the Frank-Wolfe algorithm and any secondary algorithm and switches one to the other iteratively. We show that the scheme retains the same convergence guarantee as ERLPBoost and C-ERLPBoost. One can incorporate any secondary algorithm to improve in practice.This scheme comes from a unified view of boosting algorithms for soft margin optimization. More specifically, we show that LPBoost, ERLPBoost, and C-ERLPBoost are instances of the Frank-Wolfe algorithm. In experiments on real datasets, one of the instances of our scheme exploits the better updates of the second algorithm and performs comparably with LPBoost. Link » Ryotaro Mitsuboshi · Kohei Hatano · Eiji Takimoto 🔗 - RandProx: Primal-Dual Optimization Algorithms with Randomized Proximal Updates (Poster)  link »    Proximal splitting algorithms are well suited for large-scale nonsmooth optimization problems. We propose a primal-dual algorithm, in which the dual update is randomized, with the proximity operator of one of the function replaced by a stochastic oracle. A nonsmooth variance-reduction technique is implemented so that the algorithm finds an exact minimizer of the general problem. We derive linear convergence results in presence of strong convexity. Several existing randomized algorithms, like Point-SAGA, are recovered as particular cases. Randomness helps getting faster algorithms; this has long been known for stochastic-gradient-type algorithms, and our work shows that this fully applies in the more general primal-dual setting as well. Link » Laurent Condat · Peter Richtarik 🔗 - A Unified Framework to Understand Decentralized and Federated Optimization Algorithms: A Multi-Rate Feedback Control Perspective (Poster)  link »    Distributed algorithms have been playing an increasingly important role in many applications such as machine learning, signal processing, and control. In this work, we provide a fresh perspective to understand, analyze, and design distributed optimization algorithms. Through the lens of multi-rate feedback control, we show that a wide class of distributed algorithms, including popular decentralized/federated schemes, can be viewed as discretizing a certain continuous-time feedback control system, possibly with multiple sampling rates, such as decentralized gradient descent, gradient tracking, and federated averaging. This key observation not only allows us to develop a generic framework to analyze the convergence of the entire algorithm class. More importantly, it also leads to an interesting way of designing new distributed algorithms. We develop the theory behind our framework and provide examples to highlight how the framework can be used in practice. Link » xinwei zhang · Nicola Elia · Mingyi Hong 🔗 - Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods (Poster)  link »    Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA). Although the variants of these methods were known for the minimization problems, they were never considered for solving min-max problems and VIPs. We also demonstrate the most important properties of the new methods through extensive numerical experiments. Link » Aleksandr Beznosikov · Eduard Gorbunov · Hugo Berard · Nicolas Loizou 🔗 - Optimization using Parallel Gradient Evaluations on Multiple Parameters (Poster)  link »    We propose a first-order method for convex optimization, where instead of being restricted to the gradient from a single parameter, gradients from multiple parameters can be used during each step of gradient descent. This setup is particularly useful when a few processors are available that can be used in parallel for optimization. Our method uses gradients from multiple parameters in synergy to update these parameters together towards the optima. While doing so, it is ensured that the computational and memory complexity is of the same order as that of gradient descent. Empirical results demonstrate that even using gradients from as low as \textit{two} parameters, our method can often obtain significant acceleration and provide robustness to hyper-parameter settings. We remark that the primary goal of this work is less theoretical, and is instead aimed at exploring the understudied case of using multiple gradients during each step of optimization. Link » Yash Chandak · Shiv Shankar · Venkata Gandikota · Philip Thomas · Arya Mazumdar 🔗 - An Accuracy Guaranteed Online Solver for Learning in Dynamic Feature Space (Poster)  link »    We study the problem of adding or deleting features of data from machine learning models trained using empirical risk minimization. Our focus is on algorithms in an online manner which is capable for a more general regularization term, and present practical guides to two classical regularizers, i.e., the group Lasso and $\ell_p$-norm regularizer. Across a variety of benchmark datasets, our algorithm improves upon the runtime of prior methods while maintaining the \emph{same} generalization accuracy. Link » Diyang Li · Bin Gu 🔗 - Adaptive Methods for Nonconvex Continual Learning (Poster)  link »    One of the objectives of continual learning is to prevent catastrophic forgetting in learning multiple tasks sequentially,and the existing solutions have been driven by the conceptualization of the plasticity-stability dilemma.However, the convergence of continual learning for each sequential task is less studied so far.In this paper, we provide a convergence analysis of memory-based continual learning with stochastic gradient descentand empirical evidence that training current tasks causes the cumulative degradation of previous tasks.We propose an adaptive method for nonconvex continual learning (NCCL), which adjusts step sizes of both previous and current tasks with the gradients.The proposed method can achieve the same convergence rate as the SGD method when the catastrophic forgetting term which we define in the paper is suppressed at each iteration.Further, we demonstrate that the proposed algorithm improves the performance of continual learning over existing methods for several image classification tasks. Link » Seungyub Han · Yeongmo Kim · Taehyun Cho · Jungwoo Lee 🔗 - Random initialisations performing above chance and how to find them (Poster)  link » Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al. recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes. Link » Frederik Benzing · Simon Schug · Robert Meier · Johannes von Oswald · Yassir Akram · Nicolas Zucchet · Laurence Aitchison · Angelika Steger 🔗 - On the Complexity of Finding Small Subgradients in Nonsmooth Optimization (Poster)  link »    We study the oracle complexity of producing $(\delta,\epsilon)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimension-free randomized algorithms for producing such points within $\widetilde{O}(1/\delta\epsilon^3)$ first-order oracle calls, we show that no dimension-free rate can be achieved by a deterministic algorithm. On the other hand, we point out that this rate can be derandomized for smooth functions with merely a logarithmic dependence on the smoothness parameter. Moreover, we establish several lower bounds for this task which hold for any randomized algorithm, with or without convexity. Finally, we show how the convergence rate of finding $(\delta,\epsilon)$-stationary points can be improved in case the function is convex, a setting which we motivate by proving that in general no finite time algorithm can produce points with small subgradients even for convex functions. Link » Guy Kornowski · Ohad Shamir 🔗 - On the Complexity of Finding Small Subgradients in Nonsmooth Optimization (Oral)  link » We study the oracle complexity of producing $(\delta,\epsilon)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimension-free randomized algorithms for producing such points within $\widetilde{O}(1/\delta\epsilon^3)$ first-order oracle calls, we show that no dimension-free rate can be achieved by a deterministic algorithm. On the other hand, we point out that this rate can be derandomized for smooth functions with merely a logarithmic dependence on the smoothness parameter. Moreover, we establish several lower bounds for this task which hold for any randomized algorithm, with or without convexity. Finally, we show how the convergence rate of finding $(\delta,\epsilon)$-stationary points can be improved in case the function is convex, a setting which we motivate by proving that in general no finite time algorithm can produce points with small subgradients even for convex functions. Link » Guy Kornowski · Ohad Shamir 🔗 - Solving a Special Type of Optimal Transport Problem by a Modified Hungarian Algorithm (Poster)  link »    We observe that computing empirical Wasserstein distance in the independence test is an optimal transport (OT) problem with a special structure. This observation inspires us to study a special type of OT problem and propose a modified Hungarian algorithm to solve it exactly. For an OT problem between marginals with $m$ and $n$ atoms ($m\geq n$), the computational complexity of the proposed algorithm is $\mathcal{O}(m^2n)$. Computing the empirical Wasserstein distance in the independence test requires solving this special type of OT problem, where we have $m=n^2$. The associate computational complexity of our algorithm is $\mathcal{O}(n^5)$, while the order of applying the classic Hungarian algorithm is $\mathcal{O}(n^6)$. Numerical experiments validate our theoretical analysis. Broader applications of the proposed algorithm are discussed at the end. Link » Yiling Xie · Yiling Luo · Xiaoming Huo 🔗 - BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach (Poster)  link »    Bilevel optimization (BO) is useful for solving a variety of important machine learning problems including but not limited to hyperparameter optimization, meta-learning, continual learning, and reinforcement learning.Conventional BO methods need to differentiate through the low-level optimization process with implicit differentiation, which requires expensive calculations related to the Hessian matrix. There has been a recent quest for first-order methods for BO, but the methods proposed to date tend to be complicated and impractical for large-scale deep learning applications. In this work, we propose a simple first-order BO algorithm that depends only on first-order gradient information, requires no implicit differentiation, and is practical and efficient for large-scale non-convex functions in deep learning. We provide non-asymptotic convergence analysis of the proposed method to stationary points for non-convex objectives and present empirical results that show its superior practical performance. Link » Mao Ye · Bo Liu · Stephen Wright · Peter Stone · Qiang Liu 🔗 - On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly-Communicating MDPs (Poster)  link » We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, \& Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas \& Borkar 2001), converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, \& Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating. Link » Yi Wan · Richard Sutton 🔗 - Optimizing the Performative Risk under Weak Convexity Assumptions (Poster)  link »    In performative prediction, a predictive model impacts the distribution that generates future data, a phenomenon that is being ignored in classical supervised learning. In this closed-loop setting, the natural measure of performance, named performative risk ($\mathrm{PR}$), captures the expected loss incurred by a predictive model \emph{after} deployment. The core difficulty of using the performative risk as an optimization objective is that the data distribution itself depends on the model parameters.This dependence is governed by the environment and not under the control of the learner. As a consequence, even the choice of a convex loss function can result in a highly non-convex $\mathrm{PR}$ minimization problem. Prior work has identified a pair of general conditions on the loss and the mapping from model parameters to distributions that implies convexity of the performative risk. In this paper, we relax these assumptions and focus on obtaining weaker notions of convexity, without sacrificing the amenability of the $\mathrm{PR}$ minimization problem for iterative optimization methods. Link » Yulai Zhao 🔗 - Quantization based Optimization : Alternative Stochastic Approximation of Global Optimization (Poster)  link »    In this study, we propose a global optimization algorithm based on quantizing the energy level of an objective function in an NP-hard problem.According to the white noise hypothesis for a quantization error with a dense and uniform distribution, we can regard the quantization error as i.i.d. white noise. According to stochastic analysis, the proposed algorithm converges weakly only under conditions satisfying Lipschitz continuity, instead of local convergence properties such as the Hessian constraint of the objective function. This shows that the proposed algorithm ensures global optimization by Laplace's condition. Numerical experiments show that the proposed algorithm outperforms conventional learning methods in solving NP-hard optimization problems such as the traveling salesman problem. Link » Jinwuk Seok · Changsik Cho 🔗 - Completing the Model Optimization Process by Correcting Patterns of Failure in Regression Tasks (Poster)  link »    Model selection and hyper-parameter optimization sometimes prove to be complex and costly processes with unfinished outcomes. In fact, a so-called optimized model can still suffer from patterns of failure when predicting on new data, affecting the generalization error. In this paper, we focus on regression tasks and introduce an additional stage to the model optimization process in order to render it more reliable. This new step aims to correct error patterns when the model makes predictions on unlabeled data. To that end, our method includes two techniques. AutoCorrect Rules leverage the model under/overestimation bias and applies simple rules to adjust predictions. AutoCorrect Model is a supervised approach which exploits different representations to predict residuals in order to revise model predictions. We empirically prove the relevance of our method on the outcome of an AutoML tool using different time budgets, and on a specific optimization case leveraging a pre-trained model for an image regression task. Link » Thomas Bonnier 🔗 - Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses & Extension to Non-Convex Losses (Poster)  link »    We study differentially private (DP) stochastic optimization (SO) with data containing outliers and loss functions that are (possibly) not Lipschitz continuous. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it is often unrealistic: in many practical problems, the loss function may not be uniformly Lipschitz. Even when the loss function is Lipschitz continuous, the worst-case Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work does not require the loss function to be uniformly Lipschitz. Instead, building on a recent line of work (Wang et al., 2020; Kamath et al., 2022), we make the weaker assumption that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on DP Lipschitz SO, our excess risk scales with the $k$-th moment bound instead of the Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to the prior works, our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk. Link » Andrew Lowy · Meisam Razaviyayn 🔗 - Sufficient Conditions for Non-asymptotic Convergence of Riemannian Optimization Methods (Oral)  link » Motivated by energy based analyses for descent methods in the Euclidean setting, we investigate a generalisation of such energy based analyses for descent methods over Riemannian manifolds. In doing so, we find that it is possible to derive curvature-free guarantees for such descent methods, improving on work by Zhang and Sra [2016]. This analysis allows us to study acceleration of Riemannian gradient descent in the geodesically-convex setting, and improve on an existing result by Alimisis et al 2021]. Finally, extending the analysis by Ahn and Sra [2020], we attempt to provide some sufficient conditions for the acceleration of Riemannian descent methods in the strongly geodesically convex setting. Link » Vishwak Srinivasan · Ashia Wilson 🔗