Workshop
OPT 2022: Optimization for Machine Learning
Courtney Paquette · Sebastian Stich · Quanquan Gu · Cristóbal Guzmán · John Duchi
Room 295  296
OPT 2022 will bring experts in optimization to share their perspectives while leveraging crossover experts in ML to share their views and recent advances. OPT 2022 honors this tradition of bringing together people from optimization and from ML in order to promote and generate new interactions between the two communities.
To foster the spirit of innovation and collaboration, a goal of this workshop, OPT 2022 will focus the contributed talks on research in Reliable Optimization Methods for ML. Many optimization algorithms for ML were originally developed with the goal of handling computational constraints (e.g., stochastic gradient based algorithms). Moreover, the analyses of these algorithms followed the classical optimization approach where one measures the performances of algorithms based on (i) the computation cost and (ii) convergence for any input into the algorithm. As engineering capabilities increase and the wide adoption of ML into many real world usages, practitioners of ML are seeking optimization algorithms that go beyond finding the minimizer with the fastest algorithm. They want reliable methods that solve realworld complications that arise. For example, increasingly bad actors are attempting to fool models with deceptive data. This leads to questions such as what algorithms are more robust to adversarial attacks and can one design new algorithms that can thwart these attacks? The latter question motivates a new area of optimization focusing on gametheoretic environments, that is, environments where there are competing forces at play and devising guarantees. Beyond this, a main reason for the success of ML is that optimization algorithms seemingly generate points that learn from training data; that is, we want minimizers of training data to provide meaningful interpretations on new data (generalization) yet we do not understand what features (e.g., loss function, algorithm, depth of the architectures (deep learning), and/or training samples) yield better generalization properties. These new areas of solving practical ML problems and their deep ties to the optimization community warrants a necessary discussion between the two communities. Specifically, we aim to discuss the meanings of generalization as well as the challenges facing realworld applications of ML and the new paradigms for optimizers seeking to solve them.
Plenary Speakers: All invited speakers have agreed to coming inperson to the workshop.
* Niao He (ETH, Zurich, assistant professor)
* Zico Kolter (Carnegie Mellon University, associate professor)
* Lorenzo Rosasco (U Genova/MIT, assistant professor)
* Katya Scheinberg (Cornell, full professor)
* Aaron Sidford (Stanford, assistant professor)
Schedule
Sat 6:55 a.m.  7:00 a.m.

Welcome Remarks
(
Intro
)
SlidesLive Video 
Courtney Paquette 🔗 
Sat 7:00 a.m.  7:30 a.m.

Katya Scheinberg, Stochastic Oracles and Where to Find Them
(
Plenary Speaker
)
SlidesLive Video Title: Stochastic Oracles and Where to Find Them Abstract: Continuous optimization is a mature field, which has recently undergone major expansion and change. One of the key new directions is the development of methods that do not require exact information about the objective function. Nevertheless, the majority of these methods, from stochastic gradient descent to "zeroth order" methods use some kind of approximate first order information. We will introduce a general definition of a stochastic oracle and show how this definition applies in a variety of familiar settings, including simple stochastic gradient via sampling, traditional and randomized finite difference methods, but also more specialized settings, such as robust gradient estimation. We will also overview several stochastic methods and how the general definition extends to the oracles used by these methods. 
Katya Scheinberg 🔗 
Sat 7:30 a.m.  8:00 a.m.

Contributed Talks 1
(
Contributed talks
)
SlidesLive Video Two (15 mins) contributed talks in this session.

Courtney Paquette · Tian Li · Guy Kornowski 🔗 
Sat 8:00 a.m.  9:00 a.m.

Poster Session 1
(
Poster Session
)

Andrew Lowy · Thomas Bonnier · Yiling Xie · Guy Kornowski · Simon Schug · Seungyub Han · Nicolas Loizou · xinwei zhang · Laurent Condat · Tabea E. Röber · Si Yi Meng · Marco Mondelli · Runlong Zhou · Eshaan Nichani · Adrian Goldwaser · Rudrajit Das · Kayhan Behdin · Atish Agarwala · Mukul Gagrani · Gary Cheng · Tian Li · Haoran Sun · Hossein Taheri · Allen Liu · Siqi Zhang · Dmitrii Avdiukhin · Bradley Brown · Miaolan Xie · Junhyung Lyle Kim · Sharan Vaswani · Xinmeng Huang · Ganesh Ramachandra Kini · Angela Yuan · Weiqiang Zheng · Jiajin Li

Sat 9:00 a.m.  9:30 a.m.

Contributed Talks 2
(
Contributed talks
)
SlidesLive Video Two (15 min) contributed talks in this session.

Quanquan Gu · Aaron Defazio · Jiajin Li 🔗 
Sat 9:30 a.m.  10:00 a.m.

Niao He, Simple Fixes for Adaptive Gradient Methods for Nonconvex MinMax Optimization
(
Plenary Speaker
)
SlidesLive Video Title: Simple Fixes for Adaptive Gradient Methods for Nonconvex MinMax Optimization Abstract: Adaptive gradient methods such as AdaGrad and Adam have shown their ability to adjust the stepsizes on the fly in a parameteragnostic manner and are successful in nonconvex minimization. When it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper timescale separation may fail to work in practice. In fact, even for a quadratic example, the naive combination of Gradient Descent Ascent with any existing adaptive stepsizes is proven to diverge if the initial primaldual stepsize ratio is not carefully chosen. We introduce two simple fixes for these adaptive methods, allowing automatic adaptation to the timescale separation necessary for fast convergence. The resulting algorithms are fully parameteragnostic and achieve nearoptimal complexities in deterministic and stochastic settings of nonconvexstronglyconcave minimax problems, without a priori knowledge about problemspecific parameters. This is based on joint work with Junchi Yang and Xiang Li. 
Niao He 🔗 
Sat 10:00 a.m.  12:00 p.m.

Lunch
(
Break
)

🔗 
Sat 12:00 p.m.  12:30 p.m.

Zico Kolter, Adapt like you train: How optimization at training time affects model finetuning and adaptation
(
Plenary Speaker
)
SlidesLive Video Title: Adapt like you train: How optimization at training time affects model finetuning and adaptation Abstract: With the growing use of largescale machine learning models pretrained on massive datasets, it is becoming increasingly important to understand how we can efficiently adapt these models to downstream tasks at test time. In this talk, I will discuss our recent work that highlights an important but often overlooked factor in this process: specifically, we have found in several cases that the loss function used to train the model has important implications as to the best way to finetune or adapt the model. I will highlight two specific examples of this phenomenon: 1) illustrating that using contrastive loss outperforms alternatives for finetuning contrastivelypretrained visionlanguage models; and 2) showing how we can leverage the convex conjugate of the training loss to perform labelfree test time adaptation. I will end by highlighting open questions and directions for this work. 
J. Zico Kolter 🔗 
Sat 12:30 p.m.  1:15 p.m.

Contributed Talks 3
(
Contributed talks
)
SlidesLive Video Three (15 min) contributed talks in this session.

Cristóbal Guzmán · Fangshuo Liao · Vishwak Srinivasan · Zhiyuan Li 🔗 
Sat 1:15 p.m.  1:45 p.m.

Aaron Sidford, Efficiently Minimizing the Maximum Loss
(
Plenary Speaker
)
SlidesLive Video Title: Efficiently Minimizing the Maximum Loss Abstract: In this talk I will discuss recent advances in the fundamental robust optimization problem of minimizing the maximum of a finite number of convex loss functions. In particular I will show how to develop stochastic methods for approximately solving this problem with a nearoptimal number of gradient queries. Along the way, I will cover several optimization techniques of broader utility, including accelerated methods for using balloptimization oracles and stochastic biasreduced gradient methods. This talk will include joint work with Hilal Asi, Yair Carmon, Arun Jambulapati, and Yujia Jin including https://arxiv.org/abs/2105.01778 and https://arxiv.org/abs/2106.09481. 
Aaron Sidford 🔗 
Sat 1:45 p.m.  1:50 p.m.

Closing Remarks
(
Closing
)

Courtney Paquette 🔗 
Sat 1:50 p.m.  2:50 p.m.

Poster Session 2
(
Poster Session
)

Jinwuk Seok · Bo Liu · Ryotaro Mitsuboshi · David MartinezRubio · Weiqiang Zheng · Ilgee Hong · Chen Fan · Kazusato Oko · Bo Tang · Miao Cheng · Aaron Defazio · Tim G. J. Rudner · Gabriele Farina · Vishwak Srinivasan · Ruichen Jiang · Peng Wang · Jane Lee · Nathan Wycoff · Nikhil Ghosh · Yinbin Han · David Mueller · Liu Yang · Amrutha Varshini Ramesh · Siqi Zhang · Kaifeng Lyu · David Yunis · Kumar Kshitij Patel · Fangshuo Liao · Dmitrii Avdiukhin · Xiang Li · Sattar Vakili · Jiaxin Shi



A FiniteParticle Convergence Rate for Stein Variational Gradient Descent
(
Poster
)
link
SlidesLive Video
We provide a first finiteparticle convergence rate for Stein variational gradient descent (SVGD). Specifically, with certain choices of step size sequences, SVGD with n particles drive the kernel Stein discrepancy to zero at the rate $O\left(\frac{1}{\sqrt{\log\log n}} \right)$. We suspect that the dependence on $n$ can be improved, and we hope that our explicit, nonasymptotic proof strategy will serve as a template for future refinements.

Jiaxin Shi · Lester Mackey 🔗 


Dataheterogeneityaware Mixing for Decentralized Learning
(
Poster
)
link
SlidesLive Video Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches towards decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes. We propose a metric that quantifies the ability of a graph to mix the current gradients. We further prove that the metric controls the convergence rate, particularly in settings where the heterogeneity across nodes dominates the stochasticity between updates for a given node. Motivated by our analysis, we propose an approach that periodically and efficiently optimizes the metric using standard convex constrained optimization and sketching techniques. 
Yatin Dandi · Anastasiia Koloskova · Martin Jaggi · Sebastian Stich 🔗 


Rieoptax: Riemannian Optimization in JAX
(
Poster
)
link
SlidesLive Video We present Rieoptax, an open source Python library for Riemannian optimization in JAX. We show that many differential geometric primitives, such as Riemannian exponential and logarithm maps, are usually faster in Rieoptax than existing frameworks in Python, both on CPU and GPU. We support various range of basic and advanced stochastic optimization solvers like Riemannian stochastic gradient, stochastic variance reduction, and adaptive gradient methods. A distinguishing feature of the proposed toolbox is that we also support differentially private optimization on Riemannian manifolds. 
Saiteja Utpala · Andi Han · Pratik Kumar Jawanpuria · Bamdev Mishra 🔗 


TorchOpt: An Efficient library for Differentiable Optimization
(
Poster
)
link
SlidesLive Video
Recent years have witnessed the booming of various differentiable optimization algorithms. These algorithms exhibit different execution patterns, and their execution needs massive computational resources that go beyond a single CPU and GPU. Existing differentiable optimization libraries, however, cannot support efficient algorithm development and multiCPU/GPU execution, making the development of differentiable optimization algorithms often cumbersome and expensive.This paper introduces TorchOpt, a PyTorchbased efficient library for differentiable optimization. TorchOpt provides a unified and expressive bilevel optimization programming abstraction. This abstraction allows users to efficiently declare and analyze various differentiable optimization programs with explicit gradients, implicit gradients, and zeroorder gradients. TorchOpt further provides a highperformance distributed execution runtime. This runtime can fully parallelize computationintensive differentiation operations (e.g. tensor tree flatten) on CPUs/GPUs and automatically distribute computation to distributed devices. Experimental results show that TorchOpt outperforms stateoftheart libraries by $7\times$ on an 8GPU server. TorchOpt will be open source with a permissive Apache2.0 License.

Jie Ren · Xidong Feng · Bo Liu · Xuehai Pan · Yao Fu · Luo Mai · Yaodong Yang 🔗 


On the Implicit Geometry of CrossEntropy Parameterizations for LabelImbalanced Data
(
Poster
)
link
It has been empirically observed that training large models with weighted crossentropy (CE) beyond the zerotrainingerror regime is not a satisfactory remedy for labelimbalanced data. Instead, researchers have proposed the vectorscaling (VS) loss, as a parameterization of the CE loss that is tailored to this modern training regime. The driving force to understand the impact of such parameterizations on the gradientdescent path has been the theory of implicit bias. Specifically for linear(ized) models, this theory allows to explain why weighted CE fails and how the VSloss biases the optimization path towards solutions that favor minorities. However, beyond linear models the description of implicit bias is more obscure. In order to gain insights on the impact of different CEparameterizations in nonlinear models, we investigate their implicit geometry of learnt classifiers and embeddings. Our main result characterizes the global minimizers of a nonconvex costsensitive SVM classifier for the socalled unconstrained features model, which serves as an abstraction of deep models. We also study empirically the convergence of SGD to this global minimizer observing slowdowns with increasing imbalance ratios and scalings of the loss hyperparameters. 
Tina Behnia · Ganesh Ramachandra Kini · Vala Vakilian · Christos Thrampoulidis 🔗 


Rethinking SharpnessAware Minimization as Variational Inference
(
Poster
)
link
SlidesLive Video Sharpnessaware minimisation (SAM) aims to improve the generalisation of gradientbased learning by seeking out flat minima. In this work, we establish connections between SAM and meanfield variational inference (MFVI) of neural network parameters. We show that both these methodshave interpretations as optimizing notions of flatness, and when using the reparametrisation trick,they both boil down to calculating the gradient at a perturbed version of the current mean parameter. This thinking motivates our study of algorithms that combine or interpolate between SAMand MFVI. We evaluate the proposed variational algorithms on several benchmark datasets, andcompare their performance to variants of SAM. Taking a broader perspective, our work suggeststhat SAMlike updates can be used as a dropin replacement for the reparametrisation trick. 
Szilvia Ujváry · Zsigmond Telek · Anna Kerekes · Anna Mészáros · Ferenc Huszar 🔗 


TiAda: A Timescale Adaptive Algorithm For Nonconvex Minimax Optimization
(
Poster
)
link
Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameteragnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyperparameters and the knowledge of problemdependent parameters. Such a discrepancy arises from the primaldual nature of minimax problems and the necessity of delicate timescale separation between the primal and dual updates in attaining convergence. In this work, we propose a singleloop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the timescale separation. Our algorithm is fully parameteragnostic and can achieve nearoptimal complexities simultaneously in deterministic and stochastic settings of nonconvexstronglyconcave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications. 
Xiang Li · Junchi YANG · Niao He 🔗 


TiAda: A Timescale Adaptive Algorithm For Nonconvex Minimax Optimization
(
Oral
)
link
SlidesLive Video Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameteragnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyperparameters and the knowledge of problemdependent parameters. Such a discrepancy arises from the primaldual nature of minimax problems and the necessity of delicate timescale separation between the primal and dual updates in attaining convergence. In this work, we propose a singleloop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the timescale separation. Our algorithm is fully parameteragnostic and can achieve nearoptimal complexities simultaneously in deterministic and stochastic settings of nonconvexstronglyconcave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications. 
Xiang Li · Junchi YANG · Niao He 🔗 


Toward Understanding Why Adam Converges Faster Than SGD for Transformers
(
Poster
)
link
SlidesLive Video While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question why Adam converges significantly faster than SGD in these scenarios. In this paper, we explore one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinatewise clipping as a solution to SGD and other optimization algorithms. We demonstrate the effect of coordinatewise clipping in sharpness reduction and speeding up the convergence of optimization algorithms under various settings, and conclude that the sharpness reduction effect of adaptive coordinatewise scaling is the reason for Adam’s success in practice. 
Yan Pan · Yuanzhi Li 🔗 


Bidirectional Adaptive Communication for Heterogeneous Distributed Learning
(
Poster
)
link
Communication is a key bottleneck in distributed optimization, and, in particular, bandwidth and latency can be limiting factors when devices are connected over commodity networks, such as in Federated Learning. Stateoftheart techniques tackle these challenges by advanced compression techniques or delaying communication rounds according to predefined schedules. We present a new scheme that adaptively skips communication (broadcast and client uploads) by detecting slowvarying updates. The scheme automatically adjusts the communication frequency independently for each worker and the server. By utilizing an errorfeedback mechanism~ borrowed from the compression literature~~we prove that the convergence rate is the same as for batch gradient descent %stronglyconvex, in the convex and nonconvex smooth cases. We show that the total number of communication rounds between server and clients needed to achieve a targeted accuracy is reduced, even in the case when the data distribution is highly nonIID. 
Dmitrii Avdiukhin · Vladimir Braverman · Nikita Ivkin · Sebastian Stich 🔗 


How SharpnessAware Minimization Minimizes Sharpness?
(
Poster
)
link
SharpnessAware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when fullbatch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes another notion of sharpness, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the implicit alignment between the gradient and the top eigenvector of Hessian when running SAM. 
Kaiyue Wen · Tengyu Ma · Zhiyuan Li 🔗 


How SharpnessAware Minimization Minimizes Sharpness?
(
Oral
)
link
SharpnessAware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when fullbatch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes another notion of sharpness, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the implicit alignment between the gradient and the top eigenvector of Hessian when running SAM. 
Kaiyue Wen · Tengyu Ma · Zhiyuan Li 🔗 


Strong Lottery Ticket Hypothesis with $\epsilon$–perturbation
(
Poster
)
link
SlidesLive Video
The strong Lottery Ticket Hypothesis (LTH) claims that there exists a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural networks without the need of training. This work extends the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change achieved in the pretraining step to some perturbation around the initialization.In particular, we focus on the following open questions: By allowing an $\varepsilon$scale perturbation on the random initial weights, can we reduce the overparameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation?

Fangshuo Liao · Zheyang Xiong · Anastasios Kyrillidis 🔗 


Strong Lottery Ticket Hypothesis with $\epsilon$–perturbation
(
Oral
)
link
SlidesLive Video
The strong Lottery Ticket Hypothesis (LTH) claims that there exists a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural networks without the need of training. This work extends the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change achieved in the pretraining step to some perturbation around the initialization.In particular, we focus on the following open questions: By allowing an $\varepsilon$scale perturbation on the random initial weights, can we reduce the overparameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation?

Fangshuo Liao · Zheyang Xiong · Anastasios Kyrillidis 🔗 


Optimization for Robustness Evaluation beyond ℓp Metrics
(
Poster
)
link
SlidesLive Video
Empirical evaluations of neural network models against adversarial attacks entail solving nontrivial constrained optimization problems. Popular algorithms for solving these constrained problems rely on projected gradient descent (PGD) and require careful tuning of multiple hyperparameters. Moreover, PGD can only handle $\ell_1$, $\ell_2$, and $\ell_\infty$ attacks due to the use of analytical projectors. In this paper, we introduce an alternative algorithmic framework that blends a generalpurpose constrainedoptimization solver \pygranso, \textbf{W}ith \textbf{C}onstraint\textbf{F}olding (PWCF), to add reliability and generality to the existing adversarial evaluations. PWCF 1) finds goodquality solutions without delicate tuning of multiple hyperparameters; 2) can handle general attack models which are inaccessible to the existing algorithms, e.g., $\ell_{p > 0}$, and perceptual attacks.

Hengyue Liang · Buyun Liang · Ying Cui · Tim Mitchell · Ju Sun 🔗 


Neural Networks Efficiently Learn LowDimensional Representations with SGD
(
Poster
)
link
SlidesLive Video
We study the problem of training a twolayer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multipleindex model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the firstlayer weights of the NN converge to the $k$dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $\mathcal{O}(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGDtrained ReLU NNs can learn a singleindex target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ sample requirement to learn any degree $p$ polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization.

Alireza MousaviHosseini · Sejun Park · Manuela Girotti · Ioannis Mitliagkas · Murat Erdogdu 🔗 


Nesterov Meets Optimism: RateOptimal OptimisticGradientBased Method for Stochastic BilinearlyCoupled Minimax Optimization
(
Poster
)
link
SlidesLive Video We provide a novel firstorder optimization for bilinearlycoupled stronglyconvexconcave minimax optimization we call Accelerated Optimistic Gradient (AGOG). The main idea of our algorithm is to leverage the structure of the considered minimax problem by using Nesterov’s acceleration on the individual parts and optimism on the coupling term of the objective. We first motivate our method by showing that the dynamics of its continuous version correspond to a linear combination of the ODEs of the dynamics of the Optimistic Gradient and the dynamics of Nesterov‘s acceleration. This continuous time approach allows us to showcase the main properties of our method that will eventually guide our analysis in the discrete case. Furthermore by properly restarting AGOG, we show that we can achieve optimal (up to a constant) convergence rates with respect to the conditioning of the coupling and individual parts in the stochastic and deterministic cases. 
Chris Junchi Li · Angela Yuan · Gauthier Gidel · Michael Jordan 🔗 


Distributed Online and Bandit Convex Optimization
(
Poster
)
link
SlidesLive Video We study the problems of distributed online and bandit convex optimization against an adaptive adversary. We want to minimize the average regret on M machines that communicate R times intermittently. We show that collaboration is not beneficial if the machines can access gradients of the cost functions at each time step, i.e., they have firstorder feedback. In this setting, simple noncollaborative algorithms are minmax optimal. This contrasts with the provable benefit of collaboration when optimizing against a stochastic adversary, which samples the cost functions from fixed distributions. To identify the benefit of collaboration, we consider the harder setting where the machines can only access values of their cost function, i.e., they have bandit feedback. Here, we identify the highdimensional regime where collaboration is beneficial and may even lead to a linear speedup in the number of machines. Our results bridge the gap between distributed online optimization against stochastic and adaptive adversaries. 
Kumar Kshitij Patel · Aadirupa Saha · Nati Srebro · Lingxiao Wang 🔗 


On Convexity and Linear Mode Connectivity in Neural Networks
(
Poster
)
link
SlidesLive Video In many cases, neural networks trained with stochastic gradient descent (SGD) that share an early and often small portion of the training trajectory have solutions connected by a linear path of low loss. This phenomenon, called linear mode connectivity (LMC), has been leveraged for pruning and model averaging in large neural network models, but it is not well understood how broadly or why it occurs. LMC suggests that SGD trajectories somehow end up in a \textit{``convex"} region of the loss landscape and stay there. In this work, we confirm that this eventually does happen by finding a highdimensional convex hull of low loss between the endpoints of several SGD trajectories. But to our surprise, simple measures of convexity do not show any obvious transition at the point when SGD will converge into this region. To understand this convex hull better, we investigate the functional behaviors of its endpoints. We find that only a small number of correct predictions are shared between all endpoints of a hull, and an even smaller number of correct predictions are shared between the hulls, even when the final accuracy is high for every endpoint. Thus, we tie LMC more tightly to convexity, and raise several new questions about the source of this convexity in neural network optimization. 
David Yunis · Kumar Kshitij Patel · Pedro Savarese · Gal Vardi · Jonathan Frankle · Matthew Walter · Karen Livescu · Michael Maire 🔗 


Optimal Complexity in NonConvex Decentralized Learning over TimeVarying Networks
(
Poster
)
link
SlidesLive Video Decentralized optimization with timevarying networks is an emerging paradigm in machine learning. It saves remarkable communication overhead in largescale deep training and is more robust in wireless scenarios especially when nodes are moving. Federated learning can also be regarded as decentralized optimization with timevarying communication patterns alternating between global averaging and local updates.While numerous studies exist to clarify its theoretical limits and develop efficient algorithms, it remains unclear what the optimal complexity is for nonconvex decentralized stochastic optimization over timevarying networks. The main difficulties lie in how to gauge the effectiveness when transmitting messages between two nodes via timevarying communications, and how to establish the lower bound when the network size is fixed (which is a prerequisite in stochastic optimization). This paper resolves these challenges and establish the first lower bound complexity. We also develop a new decentralized algorithm to nearly attain the lower bound, showing the tightness of the lower bound and the optimality of our algorithm. 
Xinmeng Huang · Kun Yuan 🔗 


Targetbased Surrogates for Stochastic Optimization
(
Poster
)
link
SlidesLive Video
We consider minimizing functions for which it is computationally expensive to query the (stochastic) gradient. Such functions are prevalent in applications like reinforcement learning, online imitation learning and bilevel optimization. We exploit the composite structure in these functions and propose a target optimization framework. Our framework leverages the smoothness of the loss with respect to an intermediate target space (e.g. the output of a neural network model), and uses gradient information to construct surrogate functions. In the fullbatch setting, we prove that the surrogate function is a global upperbound on the overall loss, and can be (locally) minimized using any blackbox optimization algorithm. We prove that the resulting majorizationminimization algorithm ensures convergence to a stationary point at an $O\left(\frac{1}{T}\right)$ rate, thus matching gradient descent. In the stochastic setting, we propose a stochastic surrogate optimization (SSO) algorithm that can be viewed as projected stochastic gradient descent in the target space. We leverage this connection in order to prove that SSO can match the SGD rate for stronglyconvex functions. Experimentally, we evaluate the SSO algorithm on convex supervised learning losses and show competitive performance compared to SGD and its variants.

Jonathan Lavington · Sharan Vaswani · Reza Babanezhad Harikandeh · Mark Schmidt · Nicolas Le Roux 🔗 


Why (and When) does Local SGD Generalize Better than SGD?
(
Poster
)
link
SlidesLive Video Local SGD is a communicationefficient variant of SGD for largescale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021). This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the longterm behavior of Local SGD with a small learning rate, after approaching the manifold of minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement. 
Xinran Gu · Kaifeng Lyu · Longbo Huang · Sanjeev Arora 🔗 


Momentum Extragradient is Optimal for Games with CrossShaped Spectrum
(
Poster
)
link
SlidesLive Video
The extragradient method has recently gained a lot of attention, due to its convergence behavior on smooth games. In games, the eigenvalues of the Jacobian of the vector field are distributed on the complex plane, exhibiting more convoluted dynamics compared to minimization. In this work, we take a polynomialbased analysis of the extragradient with momentum for optimizing games with \emph{crossshaped} spectrum on the complex plane. We show two results: first, the extragradient with momentum exhibits three different modes of convergence based on the hyperparameter setup: when the eigenvalues are distributed $(i)$ on the real line, $(ii)$ both on the real line along with complex conjugates, and $(iii)$ only as complex conjugates. Then, we focus on the case $(ii)$, i.e., when the spectrum of the Jacobian has \emph{crossshaped} structure, as observed in training generative adversarial networks. For this problem class, we derive the optimal parameters and show that the extragradient with momentum achieves accelerated convergence rate.

Junhyung Lyle Kim · Gauthier Gidel · Anastasios Kyrillidis · Fabian Pedregosa 🔗 


Stochastic Adaptive Regularization Method with Cubics: A High Probability Complexity Bound
(
Poster
)
link
SlidesLive Video We present a high probability complexity bound for a stochastic adaptive regularization method with cubics, also known as regularized Newton method. The method makes use of stochastic zeroth, first and secondorder oracles that satisfy certain accuracy and reliability assumptions. Such oracles have been used in the literature by other adaptive stochastic methods, such as trust region and line search. These oracles capture many settings, such as expected risk minimization, stochastic zeroth order optimization, and others. In this paper, we give the first highprobability iteration bound for stochastic cubic regularization and show that just as in the deterministic case, it is superior to other adaptive methods. 
Katya Scheinberg · Miaolan Xie 🔗 


ProxSkip for Stochastic Variational Inequalities: A Federated Learning Algorithm for Provable Communication Acceleration
(
Poster
)
link
SlidesLive Video
Recently Mishchenko et al. proposed and analyzed ProxSkip, a provably efficient method for minimizing the sum of a smooth $(f)$ and an expensive nonsmooth proximable $(R)$ function (i.e. $\min_{x \in \mathbb{R}^d} f(x) + R(x)$). The main advantage of ProxSkip, is that in the federated learning (FL) setting, offers provably an effective acceleration of communication complexity.This work extends this approach to the more general regularized variational inequality problems (VIP). In particular, we propose ProxSkipVIP algorithm, which generalizes the original ProxSkip framework to VIP, and we provide convergence guarantees for a class of structured nonmonotone problems. In the federated learning setting, we explain how our approach achieves acceleration in terms of the communication complexity over existing stateoftheart FL algorithms.

Siqi Zhang · Nicolas Loizou 🔗 


DIMENSIONREDUCED ADAPTIVE GRADIENT METHOD
(
Poster
)
link
SlidesLive Video
Adaptive gradient methods, such as Adam, have shown faster convergence speed than SGD across various kinds of network models at the expense of inferior generalization performance. In this work, we proposed a DimensionReduced Adaptive Gradient Method (DRAG) to eliminate the generalization gap. DRAG makes an elegant combination of SGD and Adam by adopting a trustregion like framework. We observe that 1) Adam adjusts stepsizes for each gradient coordinate according to some loss curvature, and indeed decomposes the $n$dimensional gradient into $n$ standard basis directions to search; 2) SGD uniformly scales gradient for all gradient coordinates and actually has only one descent direction to minimize. Accordingly, DRAG reduces the high degree of freedom of Adam and also improves the flexibility of SGD via optimizing the loss along $k\ (\ll \! n)$ descent directions, e.g. the gradient direction and momentum direction used in this work. Then per iteration, DRAG finds the best stepsizes for $k$ descent directions by solving a trustregion subproblem whose computational overhead is negligible since the trustregion subproblem is lowdimensional, e.g. $k=2$ in this work. DRAG is compatible with the common deep learning training pipeline without introducing extra hyperparameters and with negligible extra computation. Experimental results on representative benchmarks testify the fast convergence speed and also superior generalization of DRAG.

Jingyang Li · Pan Zhou · Kuangyu Ding · KimChuan Toh · Yinyu Ye 🔗 


Fast Convergence of Greedy 2Coordinate Updates for Optimizing with an Equality Constraint
(
Poster
)
link
SlidesLive Video In this work, we study the Block Coordinate Descent (BCD) algorithm with greedy block selection rule for minimizing functions with one linear equality constraint. We show a faster linear convergence rate for BCD with block size 2 (2BCD) on functions satisfying the PolyakLojasiewicz inequality. Our analysis exploits similarity between the solutions of 2BCD and equalityconstrained steepest descent (SD) in the L1norm. This yields a simple proof. 
Amrutha Varshini Ramesh · Aaron Mishkin · Mark Schmidt 🔗 


Quadratic minimization: from conjugate gradients to an adaptive heavyball method with Polyak stepsizes
(
Poster
)
link
SlidesLive Video In this work, we propose an adaptive variation on the classical heavyball method for convex quadratic minimization. The adaptivity crucially relies on socalled ``Polyak stepsizes'', which consists in using the knowledge of the optimal value of the optimization problem at hand instead of problem parameters such as a few eigenvalues of the Hessian of the problem. This method happens to also be equivalent to a variation of the classical conjugate gradient method, and thereby inherits many of its attractive features, including its finitetime convergence, instance optimality, and its worstcase convergence rates.The classical gradient method with Polyak stepsizes is known to behave very well in situations in which it can be used, and the question of whether incorporating momentum in this method is possible and can improve the method itself appeared to be open.We provide a definitive answer to this question for minimizing convex quadratic functions, a arguably necessary first step for developing such methods in more general setups. 
Baptiste Goujaud · Adrien Taylor · Aymeric Dieuleveut 🔗 


NCVX: A GeneralPurpose Optimization Solver for Constrained Machine and Deep Learning
(
Poster
)
link
SlidesLive Video Imposing explicit constraints are new and increasingly trendy in deep learning, stimulated by, e.g., trustworthy AI that performs robust optimization over complicated perturbation sets and scientific applications that need to respect physical laws and constraints. However, it can be hard to reliably solve constrained deep learning problems without optimization expertise. Existing deep learning frameworks do not admit constraints. Generalpurpose optimization packages can handle constraints but do not perform autodifferentiation and have trouble dealing with nonsmoothness. In this paper, we introduce a new software package called NCVX, whose initial release contains the solver PyGRANSO, a PyTorchenabled generalpurpose optimization package for constrained machine and deep learning problems, the first of its kind. NCVX inherits autodifferentiation, GPU acceleration, and tensor variables from PyTorch, is built on freely available and widely used open source frameworks. NCVX is available at https://ncvx.org, with detailed documentation and numerous examples from machine/deep learning and other fields. 
Buyun Liang · Tim Mitchell · Ju Sun 🔗 


A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets
(
Poster
)
link
SlidesLive Video
Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated each ReLU. This alternative \emph{(and effectively equivalent)} regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the \emph{sparse} solutions it shares with standard weight decay training.

Liu Yang · Jifan Zhang · Joseph Shenouda · Dimitris Papailiopoulos · Kangwook Lee · Robert Nowak 🔗 


Policy gradient finds global optimum of nearly linearquadratic control systems
(
Poster
)
link
SlidesLive Video We explore reinforcement learning methods for finding the optimal policy in the nearly linearquadratic control systems. In particular, we consider a dynamic system composed of the summation of a linear and a nonlinear components, which is governed by a policy with the same structure. Assuming that the nonlinear part consists of kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. While the resulting landscape is generally nonconvex, we show local strong convexity and smoothness of the cost function around the global optimizer. In addition, we design a policy gradient algorithm with a carefully chosen initialization and prove that the algorithm is guaranteed to converge to the globally optimal policy with a linear rate. 
Yinbin Han · Meisam Razaviyayn · Renyuan Xu 🔗 


Relating Regularization and Generalization through the Intrinsic Dimension of Activations
(
Poster
)
link
SlidesLive Video Given a pair of models with similar training set performance, it is natural to assume that the model that possesses simpler internal representations would exhibit better generalization. In this work, we provide empirical evidence for this intuition through an analysis of the intrinsic dimension (ID) of model activations, which can be thought of as the minimal number of factors of variation in the model's representation of the data. First, we show that common regularization techniques uniformly decrease the lastlayer ID (LLID) of validation set activations for image classification models and show how this strongly affects generalization performance. We also investigate how excessive regularization decreases a model's ability to extract features from data in earlier layers, leading to a negative effect on validation accuracy even while LLID continues to decrease and training accuracy remains nearperfect. Finally, we examine the LLID over the course of training of models that exhibit grokking. We observe that well after training accuracy saturates, when models "grok" and validation accuracy suddenly improves from random to perfect, there is a cooccurent sudden drop in LLID, thus providing more insight into the dynamics of sudden generalization. 
Bradley Brown · Jordan Juravsky · Anthony Caterini · Gabriel LoaizaGanem 🔗 


The Importance of Temperature in MultiTask Optimization
(
Poster
)
link
SlidesLive Video The promise of multitask learning is that optimizing a single model on multiplerelated tasks will lead to a better solution for all tasks than independently trained models.In practice, optimization difficulties, such as conflicting gradients, can result in negative transfer, where multitask models which perform worse than singletask models.In this work, we identify the optimization temperaturethe ratio of learning rate to batch sizeasa key factor in negative transfer.Temperature controls the level of noise in each optimization step, which prior work has shown to havea strong correlation with generalization.We demonstrate that, in some multitask settings, negative transfer may arise due to poorly set optimization temperature,rather than inherently high task conflict.The implication of this finding is that in some settings, SGD with a carefully controlledtemperature achieves comparable, and in some cases superior, performance tothat of specialized optimization procedures such as PCGrad, MGDA, and GradNorm.In particular, our results suggest that the significant additional computational burden of these specialized methods may not always be necessary.Finally, we observe a conflict between the optimal temperatures of different tasks in amultitask objective, with different levels of noise promoting better generalization for differenttasks.Our work suggests the need for novel multitask optimization methods which considerindividual task noiselevels, and their impact on generalization. 
David Mueller · Mark Dredze · Nicholas Andrews 🔗 


Network Pruning at Scale: A Discrete Optimization Approach
(
Poster
)
link
SlidesLive Video
Due to the evergrowing size of neural network models, there has been an emerging interest in compressing (i.e., pruning) neural networks by sparsifying weights in a pretrained neural network, while maintaining the performance of dense model as much as possible. In this work, we focus on a neural network pruning framework based on local quadratic models of the loss function. We present an optimizationbased approach with an $\ell_0$regression formulation, and propose novel algorithms to obtain good solutions to the combinatorial optimization problem. In practice, our basic (singlestage) approach, based on one local quadratic model approximation, is up to $10^3$ times faster than existing methods while achieving similar accuracy. We also propose a multistage method that outperforms other methods in terms of accuracy for a given sparsity constraint while remaining computationally efficient. In particular, our approach results in a 98\% sparse (i.e., 98\% of weights in dense model are set to zero) MLPNet with 90\% test accuracy (i.e., 3\% reduction in test accuracy compared to the dense model), which is an improvement over the previous best accuracy (55\%).

Wenyu Chen · Riade Benbaki · Xiang Meng · Rahul Mazumder 🔗 


A Novel Stochastic Gradient Descent Algorithm for LearningPrincipal Subspaces
(
Poster
)
link
In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can bescaled to datasets with an effectively infinite number of rows and columns. Our method consistsin defining a loss function whose minimizer is the desired principal subspace, and constructing agradient estimate of this loss whose bias can be controlled. 
Charline Le Lan · Joshua Greaves · Jesse Farebrother · Mark Rowland · Fabian Pedregosa · Rishabh Agarwal · Marc Bellemare 🔗 


Nonsmooth Composite NonconvexConcave Minimax Optimization
(
Poster
)
SlidesLive Video
Nonconvexconcave minimax optimization has received intense interest in machine learning, including learning with robustness to data distribution, learning with nondecomposable loss, adversarial learning, to name a few. Nevertheless, most existing works focus on the gradientdescentascent (GDA) variants that can only be applied in smooth settings. In this paper, we consider a family of minimax problems whose objective function enjoys the nonsmooth composite structure in the variable of minimization and is concave in the variables of maximization. By fully exploiting the composite structure, we propose a smoothed proximal linear descent ascent (\textit{smoothed} PLDA) algorithm and further establish its $\mathcal{O}(\epsilon^4)$ iteration complexity, which matches that of smoothed GDA~\cite{zhang2020single} under smooth settings. Moreover, under the mild assumption that the objective function satisfies the onesided KurdykaLojasiewicz condition with exponent $\theta \in (0,1)$, we can further improve the iteration complexity to $\mathcal{O}(\epsilon^{2 \max\{2 \theta, 1\}})$. To the best of our knowledge, this is the first provably efficient algorithm for nonsmooth nonconvexconcave problems that can achieve the optimal iteration complexity $\mathcal{O}(\epsilon^{2})$ if $\theta \in (0, 1/2]$.

Jiajin Li · Linglingzhi Zhu · Anthony ManCho So 🔗 


Decentralized Stochastic Optimization with Client Sampling
(
Poster
)
link
SlidesLive Video
Decentralized optimization is a key setting toward enabling data privacy and ondevice learning over networks.Existing research primarily focuses on distributing the objective function across $n$ nodes/clients, lagging behind the realworld challenges such as i) node availabilitynot all $n$ nodes are always available during the optimizationand ii) slow information propagation (caused by a large number of nodes $n$). In this work, we study Decentralized Stochastic Gradient Descent (DSGD) with node subsampling, i.e. when only $s~(s \leq n)$ nodes are randomly sampled out of $n$ nodes per iteration.We provide the theoretical convergence rates in smooth (convex and nonconvex) problems with heterogeneous (nonidentically distributed data) functions.Our theoretical results capture the effect of node subsampling and choice of the topology on the sampled nodes, through a metric termed \emph{the expected consensus rate}.On a number of common topologies, including ring and torus, we theoretically and empirically demonstrate the effectiveness of such a metric.

Ziwei Liu · Anastasiia Koloskova · Martin Jaggi · Tao Lin 🔗 


Escaping from Moderately Constrained Saddles
(
Poster
)
link
We give polynomial time algorithms for escaping from highdimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function $f \colon \mathbb R^d \to \mathbb R$ we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. This constitutes progress (without reliance on NPoracles or altering the definitions to only account for certain constraints) on the main open question of the breakthrough work of Ge et al. who showed an analogous result for unconstrained and equalityconstrained problems. Our results hold for both regular and stochastic gradient descent.

Dmitrii Avdiukhin · Grigory Yaroslavtsev 🔗 


Uniform Convergence and Generalization for Nonconvex Stochastic Minimax Problems
(
Poster
)
link
SlidesLive Video
This paper studies the uniform convergence and generalization bounds for nonconvex(strongly)concave (NCSC/NCC) stochastic minimax optimization. We first establish the uniform convergence between the empirical minimax problem and the population minimax problem and show the $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{2})$ and $\tilde{\mathcal{O}}(d\epsilon^{4})$ sample complexities respectively for the NCSC and NCC settings, where $d$ is the dimension number and $\kappa$ is the condition number. To the best of our knowledge, this is the first uniform convergence result measured by the firstorder stationarity in stochastic minimax optimization literature.

Siqi Zhang · Yifan Hu · Liang Zhang · Niao He 🔗 


SemiRandom Sparse Recovery in NearlyLinear Time
(
Poster
)
link
SlidesLive Video
Sparse recovery is one of the most fundamental and wellstudied inverse problems. Standard statistical formulations of the problem are provably solved by general convex programming techniques and more practical, fast (nearlylinear time) iterative methods. However, these latter "fast algorithms" have previously been observed to be brittle in various realworld settings.We investigate the brittleness of fast sparse recovery algorithms to generative model changes through the lens of studying their robustness to a "helpful" semirandom adversary, a framework which tests whether an algorithm overfits to input assumptions. We consider the following basic model: let $\mathbf{A} \in \mathbb{R}^{n \times d}$ be a measurement matrix which contains an unknown subset of rows $\mathbf{G} \in \mathbb{R}^{m \times d}$ which are bounded and satisfy the restricted isometry property (RIP), but is otherwise arbitrary. Letting $x^\star \in \mathbb{R}^d$ be $s$sparse, and given either exact measurements $b = \mathbf{A} x^\star$ or noisy measurements $b = \mathbf{A} x^\star + \xi$, we design algorithms recovering $x^\star$ informationtheoretically optimally in nearlylinear time. We extend our algorithm to hold for weaker generative models relaxing our planted RIP row subset assumption to a natural weighted variant, and show that our method's guarantees naturally interpolate the quality of the measurement matrix to, in some parameter regimes, run in sublinear time.Our approach differs from that of prior fast iterative methods with provable guarantees under semirandom generative models (ChengGe '18, Li et al. '20), which typically separate the problem of learning the planted instance from the estimation problem, i.e. they attempt to first learn the planted "good" instance (in our case, $\mathbf{G}$). However, natural conditions which make sparse recovery tractable, such as RIP, are NPhard to verify and hence first learning a sufficient row reweighting appears challenging. We eschew this approach and design a new iterative method, tailored to the geometry of sparse recovery, which is provably robust to our semirandom model. We hope our approach opens the door to new robust, efficient algorithms for other natural statistical inverse problems.

Jonathan Kelner · Jerry Li · Allen Liu · Aaron Sidford · Kevin Tian 🔗 


Generalization of Decentralized Gradient Descent with Separable Data
(
Poster
)
link
Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finitetime generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rateofconsensus of DGD for a class of selfbounded losses. Finally, we conduct numerical experiments which corroborate our theoretical results. 
Hossein Taheri · Christos Thrampoulidis 🔗 


Gradient dynamics of singleneuron autoencoders on orthogonal data
(
Poster
)
link
SlidesLive Video In this work we investigate the dynamics of (stochastic) gradient descent when training a singleneuron ReLU autoencoder on orthogonal inputs. We show that for this nonconvex problem there exists a manifold of global minima all with the same maximum Hessian eigenvalue and that gradient descent reaches a particular global minimum when initialized randomly. Interestingly, which minimum is reached depends heavily on the batchsize. For full batch gradient descent, the directions of the neuron that are initially positively correlated with the data are merely rescaled uniformly, hence in highdimensions the learned neuron is a near uniform mixture of these directions. On the other hand, with batchsize one the neuron exactly aligns with a single such direction, showing that when using a small batchsize a qualitatively different type of ``feature selection" occurs. 
Nikhil Ghosh · Spencer Frei · Wooseok Ha · Bin Yu 🔗 


A VariableCoefficient Nuclear Norm Penalty for Low Rank Inference
(
Poster
)
link
SlidesLive Video
Low rank structure is expected in many applications, so it is often desirable to be able to specify cost functions that induce low rank.A common approach is to augment the cost with a penalty function approximating the rank function, such as the nuclear norm which is given by the $\ell_1$ norm of the matrix's singular values.This has the advantage of being a convex function, but it biases matrix entries towards zero.On the other hand, nonconvex approximations to the rank function can make better surrogates but invariably introduce additional hyperparameters.In this article, we instead study a weighted nuclear norm approach with learnable weights which provides the behavior of nonconvex penalties without introducing any additional hyperparameters.This approach can also benefit from the fast proximal methods which make nuclear norm approaches scalable.We demonstrate the potential of this technique by comparing it against the standard nuclear norm approach on synthetic and realistic matrix denoising and completion problems.We also outline the future work necessary to deploy this algorithm to large scale problems.

Nathan Wycoff · Ali Arab · Lisa Singh 🔗 


Exact Gradient Computation for Spiking Neural Networks
(
Poster
)
link
SlidesLive Video Spiking neural networks (SNNs) have recently emerged as an alternative to traditional neural networks, holding promise for energy efficiency benefits. However, the classic backpropagation algorithm for training traditional networks has been notoriously difficult to apply to SNNs due to the hardthresholding and discontinuities at spike times. Therefore, a large majority of prior work believes that exact gradients for SNN w.r.t. their weights do not exist and has focused on approximation methods to produce surrogate gradients. In this paper, (1)\,by applying the implicit function theorem to SNN at the discrete spike times, we prove that, albeit being nondifferentiable in time, SNNs have welldefined gradients w.r.t. their weights, and (2)\,we propose a novel training algorithm, called \emph{forward propagation} (FP), that computes exact gradients for SNNs. Our derivation of FP in this paper provides insights on why other related algorithms such as Hebbian learning and also recentlyproposed surrogate gradient methods may perform well. 
Jane Lee · Saeid Haghighatshoar · Amin Karbasi 🔗 


Linear Convergence Analysis of Neural Collapse with Unconstrained Features
(
Poster
)
link
SlidesLive Video In this work, we study the recently discovered neural collapse (NC) phenomenon, which is prevalent in training overparameterized deep neural networks for classification tasks. Existing work has shown that any optimal solution of the trained problem for classification tasks is an NC solution and has a benign landscape under the unconstrained feature model. However, these results do not provide an answer to the question of how quickly gradient descent can find an NC solution. To answer this question, under the unconstrained feature model we prove an error bound property of the trained loss, which refers to the inequality that bounds the distance of a point in the optimal solution set by the norm of its gradient. Using this error bound, we can show linear convergence of gradient descent for finding an NC solution. 
Peng Wang · Huikang Liu · Can Yaras · Laura Balzano · Qing Qu 🔗 


The Solution Path of the Group Lasso
(
Poster
)
link
SlidesLive Video We prove continuity of the solution path for the group lasso, a popular method of computing group sparse models. Unlike the more classical lasso method, the group lasso solution path is nonlinear and cannot be determined in closed form. To circumvent this, we first characterize the group lasso solution set and then show how to construct an implicit function for the minnorm path. We prove our implicit representation is continuous almost everywhere and extend this to continuity every where when the group lasso solution is unique. Our work can be viewed as extending solution path analyses from lasso setting to the group lasso and implies that gridsearch is a sensible approach to hyperparameter selection. Our results also have applications to convex reformulations of neural neural networks and so are deeply connected to solution paths for shallow neural networks. 
Aaron Mishkin · Mert Pilanci 🔗 


Conditional gradientbased method for bilevel optimization with convex lowerlevel problem
(
Poster
)
link
SlidesLive Video
In this paper, we study simple bilevel optimization problems, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are not satisfactory as they are either asymptotic for the upperlevel objective, or the convergence rates are slow and suboptimal. To address this issue, in this paper, we introduce a conditional gradientbased (CGbased) method to solve the considered problem. The main idea is to locally approximate the solution set of the lowerlevel problem via a cutting plane, and then run a CGtype update to decrease the upperlevel objective. When the upperlevel objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$optimal for the upperlevel objective and $\epsilon_g$optimal for the lowerlevel objective. Moreover, when the upperlevel objective is nonconvex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$optimal solution. To the best of our knowledge, our method achieves the bestknown iteration complexity for the considered bilevel problem.

Ruichen Jiang · Nazanin Abolfazli · Aryan Mokhtari · Erfan Yazdandoost Hamedani 🔗 


Sufficient conditions for nonasymptotic convergence of Riemannian optimization methods
(
Poster
)
link
SlidesLive Video Motivated by energy based analyses for descent methods in the Euclidean setting, we investigate a generalisation of such analyses for descent methods over Riemannian manifolds. In doing so, we find that it is possible to derive curvaturefree guarantees for such descent methods. This also enables us to give the first known guarantees for a Riemannian cubicregularised Newton algorithm over gconvex functions, which extends the guarantees by Agarwal et al for an adaptive Riemannian cubicregularised Newton algorithm over general nonconvex functions. This analysis motivates us to study acceleration of Riemannian gradient descent in the gconvex setting, and we improve on an existing result by Alimisis et al, albeit with a curvaturedependent rate. Finally, extending the analysis by Ahn and Sra, we attempt to provide some sufficient conditions for the acceleration of Riemannian descent methods in the strongly geodesically convex setting. 
Vishwak Srinivasan · Ashia Wilson 🔗 


A Neural Tangent Kernel Perspective on FunctionSpace Regularization in Neural Networks
(
Poster
)
link
SlidesLive Video Loss regularization can help reduce the gap between training and test error by systematically limiting model complexity. Popular regularization techniques such as L2 weight regularization act directly on the network parameters, but do not explicitly take into account how the interplay between the parameters and the network architecture may affect the induced predictive functions.To address this shortcoming, we propose a simple technique for effective functionspace regularization. Drawing on the result that fullytrained wide multilayer perceptrons are equivalent to kernel regression under the Neural Tangent Kernel (NTK), we propose to approximate the norm of neural network functions by the reproducing kernel Hilbert space norm under the NTK and use it as a functionspace regularizer. We prove that neural networks trained using this regularizer are arbitrarily close to kernel ridge regression solutions under the NTK. Furthermore, we provide a generalization error bound under the proposed regularizer and empirically demonstrate improved generalization and stateoftheart performance on downstream tasks where effective regularization on the induced space of functions is essential. 
Zonghao Chen · Xupeng Shi · Tim G. J. Rudner · Qixuan Feng · Weizhong Zhang · Tong Zhang 🔗 


Annealed Training for Combinatorial Optimization on Graphs
(
Poster
)
link
SlidesLive Video Learning neural networks for CO problems is notoriously difficult given the lack of labeled data as the training gets trapped easily at local optima. However, the hardness of combinatorial optimization (CO) problems hinders collecting solutions for supervised learning. We propose a simple but effective unsupervised annealed training framework for CO problems in this work. In particular, we transform CO problems into unbiased energybased models (EBMs). We carefully selected the penalties terms to make the EBMs as smooth as possible. Then we train graph neural networks to approximate the EBMs and we introduce an annealed loss function to prevent the training from being stuck at local optima near the initialization.An experimental evaluation demonstrates that our annealed training framework obtains substantial improvements. In four types of CO problems, our method achieves performance substantially better than other unsupervised neural methods on both synthetic and realworld graphs. 
Haoran Sun · Etash Guha · Hanjun Dai 🔗 


Clairvoyant Regret Minimization: Equivalence with Nemirovski’s Conceptual Prox Method and Extension to General Convex Games
(
Poster
)
link
SlidesLive Video A recent paper by Piliouras et al. introduces an uncoupled learning algorithm for normalform gamescalled Clairvoyant MWU (CMWU). In this paper we show that CMWU is equivalent to the conceptual prox method described by Nemirovski. This connection immediately shows that it is possible to extend the CMWU algorithm to any convex game, a question left open by Piliouras et al. We call the resulting algorithmagain equivalent to the conceptual prox methodClairvoyant OMD. At the same time, we show that our analysis yields an improved regret bound compared to the original bound by Piliouras et al., in that the regret of CMWU scales only with the square root of the number of players, rather than the number of players themselves. 
Gabriele Farina · Christian Kroer · ChungWei Lee · Haipeng Luo 🔗 


Parameter Free Dual Averaging: Optimizing Lipschitz Functions in a Single Pass
(
Poster
)
link
SlidesLive Video
Both gradient descent and dual averaging for convex Lipschitz functionshave convergence rates that are highly dependent on the choice of learningrate. Even when the Lipschitz constant is known, setting the learning rate to achieve the optimal convergence rate requires knowing a bound on the distance from the initial point to the solution set $D$. A numberof approaches are known that relax this requirement, but they eitherrequire line searches, restarting (hyperparameter grid search), or do not derivefrom the gradient descent or dual averaging frameworks (coinbetting).In this work we describe a single pass method, with no backtracking or line searches, derived from dual averaging,which does not require knowledge of $D$ yet asymptotically achievesthe optimal rate of convergence for the complexity class of ConvexLipschitz functions.

Aaron Defazio · Konstantin Mishchenko 🔗 


Differentially Private Adaptive Optimization with Delayed Preconditioners
(
Poster
)
link
SlidesLive Video Privacy costs may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and nonconvex problems, and analyze tradeoffs between delay and privacy noise reduction. Empirically, we explore DP^2 across several realworld datasets, demonstrating that it can improve convergence speed by as much as 4× relative to nonadaptive baselines and match the performance of stateoftheart optimization methods that require auxiliary data. 
Tian Li · Manzil Zaheer · Ken Liu · Sashank Reddi · H. Brendan McMahan · Virginia Smith 🔗 


Differentially Private Adaptive Optimization with Delayed Preconditioners
(
Oral
)
link
Privacy costs may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP^2), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and nonconvex problems, and analyze tradeoffs between delay and privacy noise reduction. Empirically, we explore DP^2 across several realworld datasets, demonstrating that it can improve convergence speed by as much as 4× relative to nonadaptive baselines and match the performance of stateoftheart optimization methods that require auxiliary data. 
Tian Li · Manzil Zaheer · Ken Liu · Sashank Reddi · H. Brendan McMahan · Virginia Smith 🔗 


A Lightspeed Linear Program Solver for Personalized Recommendation with Diversity Constraints
(
Poster
)
link
SlidesLive Video We study a structured linear program (LP) that emerges in the need of ranking candidates or items in personalized recommender systems. Since the candidate set is only known in real time, the LP also needs to be solved in real time. Latency and user experience are major considerations, requiring the LP to be solved within just a few milliseconds. Although typical instances of the problem are not very large in size, this stringent time limit is beyond the capability of most existing commercial LP solvers, which can take 20 milliseconds or more to find a solution. Thus, reliable methods that address the realworld complication of latency become necessary.In this paper, we propose a fast specialized LP solver for a structured problem with diversity constraints.Our method solves the dual problem, making use of the piecewise affine structure of the dual objective function, with an additional screening technique that helps reduce the dimensionality of the problem as the algorithm progresses. Experiments reveal that our method can solve the problem within roughly 1 millisecond, yielding a 20x improvement in speed over the most performant standard LP solvers. This speedup can help improve the quality of recommendations without affecting user experience, highlighting how optimization can provide solid orthogonal value to machinelearned recommender systems. 
Miao Cheng · Haoyue Wang · Aman Gupta · Rahul Mazumder · Sathiya Selvaraj · Kinjal Basu 🔗 


adaStar: A Method for Adapting to Interpolation
(
Poster
)
link
Stochastic convex optimization methods are much faster at minimizing interpolation problems—problems where all sample losses share a common minimizer—than noninterpolating problems. However, stochastic gradient methods need to use step sizes tailored for the interpolation setting, which are suboptimal for noninterpolating problems, to attain these fast rates. This is problematic because verifying whether a problem is interpolating, without minimizing it, is difficult. Moreover, because interpolation is not a stable property—the addition of a single datapoint can transform an interpolating dataset into a noninterpolating one—we would like our methods to get the fast interpolation rate when it can while being robust to these perturbations. We address these problems by presenting an adaptive stochastic gradient method, termed adaStar, which attains the optimal, fast rate on smooth interpolation problems (up to log factors) and gracefully degrades with the minimal objective value for noninterpolating problems. We use this method as a building block to construct another stochastic gradient method, termed adaStarG which adapts to interpolation and growth conditions, getting even faster rates. 
Gary Cheng · John Duchi 🔗 


PyEPO: A PyTorchbased EndtoEnd PredictthenOptimize Library with Linear Objective Function
(
Poster
)
link
SlidesLive Video In many practical settings, some parameters of an optimization problem may be a priori unknown but can be estimated from historical data. Recently, endtoend predictthenoptimize has emerged as an attractive alternative to the twostage approach of separately fitting a predictive model for the unknown parameters, then optimizing. In this work, we present the PyEPO package, a PyTorchbased endtoend predictthenoptimize library in Python for linear and integer programming. It provides two base algorithms: the first is based on the convex surrogate loss function from the seminal work of Elmachtoub & Grigas (2021), and the second is based on the differentiable blackbox solver approach of Vlastelica et al. (2019). PyEPO provides a simple interface for the definition of new optimization problems, the implementation of stateoftheart predictthenoptimize training algorithms, the use of custom neural network architectures, and the comparison of endtoend approaches with the twostage approach. 
Bo Tang · Elias Khalil 🔗 


Accelerating Perturbed Stochastic Iterates in Asynchronous LockFree Optimization
(
Poster
)
link
SlidesLive Video We show that stochastic acceleration can be achieved under the perturbed iterate framework (Mania et al., 2017) in asynchronous lockfree optimization, which leads to the optimal incremental gradient complexity for finitesum objectives. We prove that our new accelerated method requires the same linear speedup condition as existing nonaccelerated methods. Our key algorithmic discovery is a new accelerated SVRG variant with sparse updates. Empirical results are presented to verify our theoretical findings. 
Kaiwen Zhou · Anthony ManCho So · James Cheng 🔗 


Neural DAG Scheduling via OneShot Priority Sampling
(
Poster
)
link
We consider the problem of scheduling operations/nodes, the dependency among which is characterized by a Directed Acyclic Graph (DAG). Due to its NPhard nature, heuristic algorithms were traditionally used to acquire reasonably good solutions, and more recent works have proposed Machine Learning (ML) heuristics that can generalize to unseen graphs and outperform the nonML heuristics. However, it is computationally costly to generate solutions using existing ML schedulers since they adopt the episodic reinforcement learning framework that necessitates multiround neural network processing. We propose a novel ML scheduler that uses a oneshot neural network encoder to sample node priorities which are converted by list scheduling to the final schedules. Since the oneshot encoder can efficiently sample the priorities in parallel, our algorithm runs significantly faster than existing ML baselines and has comparable run time with the fast traditional heuristics. We empirically show that our algorithm generates better schedules than both nonneural and neural baselines across various realworld and synthetic scheduling tasks. 
Wonseok Jeon · Mukul Gagrani · Burak Bartan · Weiliang Zeng · Harris Teague · Piero Zappi · Christopher Lott 🔗 


Stochastic Gradient Estimator for Differentiable NAS
(
Poster
)
link
Neural architecture search (NAS) has recently attracted more attention due to its ability to design deep neural networks automatically. Differentiable NAS methods have predominated due to their search efficiency. However, differentiable NAS methods consistently adopt approximate gradientbased methods to solve bilevel optimization problems. While second derivative approximation optimizes Jacobian or/and Hessian vector computation, it is imprecise and timeconsuming in practice. In this paper, we revisit the hypergradient of bilevel optimization problems in NAS, then propose a new optimizer based on a stochastic gradient estimator(SGE) for the computation of the Jacobian matrix in the hypergradient. The SGE is adaptable to previous differentiable NAS methods and eliminates the secondorder computation in the optimization process. In the experiments on commonly differentiable NAS benchmarks, the proposed SGENAS algorithm outperforms the baseline algorithm. The test result demonstrates that the proposed SGENAS can effectively reduce search time and find the model with higher classification performance. 
Libin Hou · Linyuan Wang · Qi Peng · Bin Yan 🔗 


Nearoptimal decentralized algorithms for network dynamic optimization
(
Poster
)
link
SlidesLive Video
We study dynamic decisionmaking problems in networks under stochastic uncertainty about future payoffs. The network has a bounded degree, and each node takes a discrete decision at each period, leading to a perperiod payoff that is a sum of three parts: node rewards for individual node decisions, temporal interactions between individual node decisions from the current and previous periods, and spatial interactions between decisions from pairs of neighboring nodes. The objective is to maximize the expected total payoffs over a finite horizon. We propose a decentralized algorithm whose computational requirement is linear in the graph size and planning horizon, and characterize sufficient conditions under which our decentralized algorithm achieves near optimality compared to the centralized global optimal. The class of decentralized algorithms is parameterized by locality parameter $L$. An $L$local algorithm makes its decision at each node $v$ based on current and (simulated) future payoffs only up to $L$ periods ahead, and only in an $L$radius neighborhood around $v$. Given any permitted error $\epsilon > 0$, with $L = O(\log(1/\epsilon))$, we show that $L$local algorithm has an average pernodeperperiod optimality loss of up to $\epsilon$ when temporal and spatial interactions are relatively small compared to the randomness in the node rewards and the graph degree.

Judy Gan · Yashodhan Kanoria · Xuan Zhang 🔗 


A Secondorder Regression Model Shows Edge of Stability Behavior
(
Poster
)
link
SlidesLive Video Recent studies of learning algorithms have shown that there is a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). We consider a class of predictive models that are quadratic in the parameters, which we call secondorder regression models. This is in contrast with the neural tangent kernel regime, where the predictive function is linear in the parameters. For quadratic objectives in two dimensions, we prove that this second order regression model exhibits both progressive sharpening and edge of stability behavior. We then show that in higher dimensions, the model shows this behavior generically without the structure of a neural network, due to a nonlinearity induced in the learning dynamics. Finally, we show that edge of stability behavior in neural networks is correlated with the behavior in quadratic regression models. 
Fabian Pedregosa · Atish Agarwala · Jeffrey Pennington 🔗 


Online Minmax Optimization: Nonconvexity, Nonstationarity, and Dynamic Regret
(
Poster
)
link
SlidesLive Video Online minmax optimization has recently gained considerable interest due to its rich applications to game theory, multiagent reinforcement learning, online robust learning, etc. Theoretical understanding in this field has been mainly focused on convexconcave settings. Online minmax optimization with nonconvex geometries, which captures various online deep learning problems, has yet been studied so far. In this paper, we make the first effort and investigate online nonconvexstronglyconcave minmax optimization in the nonstationary environment. We first introduce a natural notion of dynamic Nash equilibrium (NE) regret, and then propose a novel algorithm coined SODA to achieve the optimal regret. We further generalize our study to the setting with stochastic firstorder feedback, and show that a variation of SODA can also achieve the same optimal regret in expectation. Our theoretical results and the superior performance of the proposed method are further validated by empirical experiments. To our best knowledge, this is the first exploration of efficient online nonconvex minmax optimization. 
Yu Huang · Yuan Cheng · Yingbin Liang · Longbo Huang 🔗 


Improved Deep Neural Network Generalization Using mSharpnessAware Minimization
(
Poster
)
link
SlidesLive Video Modern deep learning models are overparameterized, where the optimization setup strongly affects the generalization performance. A key element of reliable optimization for these systems is the modification of the loss function. SharpnessAware Minimization (SAM) modifies the underlying loss function to guide descent methods towards flatter minima, which arguably have better generalization abilities. In this paper, we focus on a variant of SAM known as mSAM, which, during training, averages the updates generated by adversarial perturbations across several disjoint shards of a minibatch. Recent work suggests that mSAM can outperform SAM in terms of test accuracy. However, a comprehensive empirical study of mSAM is missing from the literatureprevious results have mostly been limited to specific architectures and datasets. To that end, this paper presents a thorough empirical evaluation of mSAM on various tasks and datasets. We provide a flexible implementation of mSAM and compare the generalization performance of mSAM to the performance of SAM and vanilla training on different image classification and natural language processing tasks. We also conduct careful experiments to understand the computational cost of training with mSAM, its sensitivity to hyperparameters and its correlation with the flatness of the loss landscape. Our analysis reveals that mSAM yields superior generalization performance and flatter minima, compared to SAM, across a wide range of tasks without significantly increasing computational costs. 
Kayhan Behdin · Qingquan Song · Aman Gupta · Sathiya Selvaraj · David Durfee · Ayan Acharya · Rahul Mazumder 🔗 


Reducing Communication in Nonconvex Federated Learning with a Novel SingleLoop Variance Reduction Method
(
Poster
)
link
SlidesLive Video
In Federated Learning (FL), interclient heterogeneity causes two types of errors: (i) \emph{client drift error} which is induced by multiple local updates, (ii) \emph{client sampling error} due to partial participation of clients at each communication. While several solutions have been offered to the former one, there is still much room of improvement on the latter one.We provide a fundamental solution to this client sampling error. The key is a novel singleloop variance reduction algorithm, SLEDGE (SingleLoop mEthoD for Gradient Estimator), which does not require periodic computation of full gradient but achieves optimal gradient complexity in the nonconvex finitesum setting. While sampling a small number of clients at each communication round, the proposed FL algorithm, FLEDGE, requires provably fewer or at least equivalent communication rounds compared to any existing method, for finding first and even secondorder stationary points in the general nonconvex setting, and under the PL condition. Moreover, under less Hessianheterogeneity between clients, the required number of communication rounds approaches to $\tilde{\Theta}(1)$.

Kazusato Oko · Shunta Akiyama · Tomoya Murata · Taiji Suzuki 🔗 


Fast Convergence of Random Reshuffling under Interpolation and the PolyakŁojasiewicz Condition
(
Poster
)
link
SlidesLive Video Modern machine learning models are often overparameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling withoutreplacement variant of Stochastic Gradient Descent (SGD), known as Random Reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch. For underparameterized models, it has been recently shown that RR converges faster than SGD when the number of epochs is larger than the condition number (κ) of the problem under standard assumptions like strong convexity. However, previous works do not show that RR outperforms SGD under interpolation for strongly convex objectives. Here, we show that for the class of PolyakŁojasiewicz (PL) functions that generalizes strong convexity, RR can outperform SGD as long as the number of samples (n) is less than the parameter (ρ) of a strong growth condition (SGC). 
Chen Fan · Christos Thrampoulidis · Mark Schmidt 🔗 


ZerO Initialization: Initializing Neural Networks with only Zeros and Ones
(
Poster
)
link
Deep neural networks are usually initialized with random weights, with adequately selected initial variance to ensure stable signal propagation during training. However, selecting the appropriate variance becomes challenging especially as the number of layers grows. In this work, we replace random weight initialization with a fully deterministic initialization scheme, viz., ZerO, which initializes the weights of networks with only zeros and ones (up to a normalization factor), based on identity and Hadamard transforms. Through both theoretical and empirical studies, we demonstrate that ZerO is able to train networks without damaging their expressivity. Applying ZerO on ResNet achieves stateoftheart performance on various datasets, including ImageNet, which suggests random weights may be unnecessary for network initialization. In addition, ZerO has many benefits, such as training ultra deep networks (without batchnormalization), exhibiting lowrank learning trajectories that result in lowrank and sparse solutions, and improving training reproducibility. 
Jiawei Zhao · Florian Schaefer · Anima Anandkumar 🔗 


Differentially Private Federated Learning with Normalized Updates
(
Poster
)
link
SlidesLive Video The customary approach for clientlevel differentially private federated learning (FL) is to add Gaussian noise to the average of the clipped client updates. Clipping is associated with the following issue: as the client updates fall below the clipping threshold, they get drowned out by the added noise, inhibiting convergence. To mitigate this issue, we propose replacing clipping with normalization, where we use only a scaled version of the unit vector along the client updates. Normalization ensures that the noise does not drown out the client updates even when the original updates are small. We theoretically show that the resulting normalizationbased private FL algorithm attains better convergence than its clippingbased counterpart on convex objectives in overparameterized settings. 
Rudrajit Das · Abolfazl Hashemi · Sujay Sanghavi · Inderjit Dhillon 🔗 


Adaptive Inexact Sequential Quadratic Programming via Iterative Randomized Sketching
(
Poster
)
link
SlidesLive Video We consider solving nonlinear optimization problems with equality constraints. We propose a randomized algorithm based on sequential quadratic programming (SQP) with a differentiable exact augmented Lagrangian as the merit function. In each SQP iteration, we solve the Newton system inexactly via iterative randomized sketching. The accuracy of the inexact solution and the penalty parameter of the augmented Lagrangian are adaptively controlled in the algorithm to ensure that the inexact random search direction is a descent direction of the augmented Lagrangian. This allows us to establish global convergence almost surely. Moreover, we show that a unit stepsize is admissible for the inexact search direction provided the iterate lies in a neighborhood of the solution. Based on this result, we show that the proposed algorithm exploits local linear convergence. We apply the algorithm on benchmark nonlinear problems in CUTEst test set and on constrained logistic regression with datasets from LIBSVM to demonstrate its superior performance. 
Ilgee Hong · Sen Na · Mladen Kolar 🔗 


Enhanced Index Tracking via Differentiable Assets Sorting
(
Poster
)
link
Enhanced index tracking (EIT) aims to achieve better performance over a target equity index while maintaining a relatively low tracking error. It can be formulated as a quadratic programming problem, but remains challenging when several practical constraints exist, especially the fixed number of assets in the portfolio. In this work, we propose a new method for enhanced index tracking, subject to common practical constraints, including cardinality, which is based on a novel reparametrisation of portfolio weights integrated with a stochastic optimisation. It can simultaneously tackle asset selection and capital allocation, while being optimised by vanilla gradient descent effectively and efficiently. The proposed method is backtested with S&P 500 and Russell 1000 indices data for over a decade. Empirical results demonstrate its superiority over widely used alternatives. 
Yuanyuan Liu · Yongxin Yang 🔗 


Learning deep neural networks by iterative linearisation
(
Poster
)
link
SlidesLive Video The excellent realworld performance of deep neural networks has received increasing attention. Despite the capacity to overfit significantly, such large models work better than smaller ones. This phenomenon is often referred to as the scaling law by practitioners. It is of fundamental interest to study why the scaling law exists and how it avoids/controls overfitting. One approach has been looking at infinite width limits of neural networks (e.g., Neural Tangent Kernels, Gaussian Processes); however, in practise, these do not fully explain finite networks as their infinite counterparts do not learn features. Furthermore, the empirical kernel for finite networks (i.e., the inner product of feature vectors), changes significantly during training in contrast to infinite width networks. In this work we derive a iterative linearised training method. We justify iterative lineralisation as an interpolation between finite analogs of the infinite width regime, which do not learn features, and standard gradient descent training which does. We show some preliminary results where iterative linearised training works well, noting in particular how much feature learning is required to achieve comparable performance. We also provide novel insights into the training behaviour of neural networks. 
Adrian Goldwaser · Hong Ge 🔗 


SelfStabilization: The Implicit Bias of Gradient Descent at the Edge of Stability
(
Poster
)
link
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(\theta)$, is bounded by $2/\eta$, training is "stable" and the training loss decreases monotonically. However, Cohen et al. (2021) recently observed two important phenomena. The first, \emph{progressive sharpening}, is that the sharpness steadily increases throughout training until it reaches the instability cutoff $2/\eta$. The second, \emph{edge of stability}, is that the sharpness hovers at $2/\eta$ for the remainder of training while the loss nonmonotonically decreases.We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call \emph{selfstabilization}, is a general property of gradient descent and explains its behavior at the edge of stability.A key consequence of selfstabilization is that gradient descent at the edge of stability implicitly follows \emph{projected} gradient descent (PGD) under the constraint $S(\theta) \le 2/\eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.

Alex Damian · Eshaan Nichani · Jason Lee 🔗 


Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization
(
Poster
)
link
SlidesLive Video Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is randomly generated. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multistep to singlestep. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to verify our findings. 
Runlong Zhou · Yuandong Tian · YI WU · Simon Du 🔗 


Solving Constrained Variational Inequalities via a Firstorder Interior Pointbased Method
(
Poster
)
link
SlidesLive Video
We focus on the open problem to develop a firstorder method that can solve constrained Variational Inequality (cVI) problems when given general constraints. We generalize the \textit{alternating direction method of multipliers} (ADMM) method and combine it with interiorpoint approaches, yielding a firstorder method that we refer to as ADMMbased interiorpoint method for cVIs (ACVI). We provide convergence guarantees for ACVI in two general classes of problems: (i) when the operator is $\xi$monotone, and (i) when it is monotone, some constraints are active and the game is not purely rotational. When the operator is, in addition, $L$Lipschitz for the latter case, we match known lower bounds on rates for the gap function of $\mathcal{O}(1/\sqrt{K})$ and $\mathcal{O}(1/K)$ for the last and average iterate, respectively. To our knowledge, this is the first \emph{first}order interiorpoint method for the general cVI problem that has a global convergence guarantee. Empirical analyses demonstrate clear advantages of ACVI over common firstorder methods. In particular, (i) cyclical behavior is notably reduced as our method approaches the solution from the analytic center, and (ii) unlike projectionbased methods that zigzag when near a constraint, ACVI efficiently handles the constraints.

Tong Yang · Michael Jordan · Tatjana Chavdarova 🔗 


Meanfield analysis for heavy ball methods: Dropoutstability, connectivity, and global convergence
(
Poster
)
link
SlidesLive Video The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a lowloss path, and \emph{(iii)} convergence to the global optimum.To achieve this goal, we take a meanfield view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This meanfield perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the meanfield limit and the SHB dynamics of a finitewidth network. Armed with this last bound, we are able to establish the dropoutstability and connectivity of SHB solutions. 
Diyuan Wu · Vyacheslav Kungurtsev · Marco Mondelli 🔗 


A Stochastic ProxLinear Method for CVaR Minimization
(
Poster
)
link
SlidesLive Video We develop an instance of the stochastic proxlinear method for minimizing the Conditional ValueatRisk (CVaR) objective. CVaR is a risk measure focused on minimizing worstcase performance, defined as the average of the top quantile of the losses. In machine learning, such a risk measure is useful to train more robust models. Although the stochastic subgradient method (SGM) is a natural choice for minimizing CVaR objective, we show that the proxlinear algorithm can be used to better exploit the structure of the objective, while still providing a convenient closed form update. We then specialize a general convergence theorem for the proxlinear method to our setting, and show that it allows for a wider selection of step sizes compared to SGM. We support this theoretical finding experimentally, by showing that the performance of stochastic proxlinear is more robust to the choice of step size compared to SGM. 
Si Yi Meng · Vasileios Charisopoulos · Robert Gower 🔗 


Counterfactual Explanations Using Optimization With Constraint Learning
(
Poster
)
link
SlidesLive Video Counterfactual explanations have an invaluable potential to make model predictions more sensible to the users. To increase their adoption in practice, several criteria that counterfactual explanations should adhere to have been put forward in the literature. We propose counterfactual explanations using optimization with constraint learning (CEOCL), a generic and flexible approach that addresses all these criteria and allows room for further extensions. Specifically, we discuss how we can leverage an optimization with constraint learning framework for the generation of counterfactual explanations, and how components of this framework readily map to the criteria. We also propose two novel modeling approaches to address data manifold closeness and diversity, which are two key criteria for practical counterfactual explanations. We test CEOCL on several datasets and present our results in a case study. Compared against the current stateoftheart methods, CEOCL allows for more flexibility and has an overall superior performance in terms of several evaluation metrics proposed in related work. 
Donato Maragno · Tabea E. Röber · Ilker Birbil 🔗 


Accelerated SingleCall Methods for Constrained MinMax Optimization
(
Poster
)
link
SlidesLive Video
We study firstorder methods for constrained minmax optimization. Existing methods either requires two gradient calls or two projections in each iteration, which may be costly in applications. In this paper, we first show that the \emph{Optimistic Gradient (OG)} method, a \emph{singlecall singleprojection} algorithm, has $O(\frac{1}{\sqrt{T}})$ convergence rate for inclusion problems with operators that satisfy the weak Minty variation inequality (MVI). Our second result is the first singlecall singleprojection algorithm  the \emph{Accelerated Reflected Gradient (ARG)} method that achieves the \emph{optimal $O(\frac{1}{T})$} convergence rate for inclusion problems that satisfy negative comonotonicity. Both the weak MVI and negative comonotonicity are wellstudied assumptions and capture a rich set of nonconvex nonconcave minmax optimization problems. Finally, we show that the \emph{Reflected Gradient (RG)} method, another \emph{singlecall singleprojection} algorithm, has $O(\frac{1}{\sqrt{T}})$ lastiterate convergence rate for constrained convexconcave minmax optimization, answering an open problem of (Hsieh et al., 2019).

Yang Cai · Weiqiang Zheng 🔗 


Accelerated Algorithms for Monotone Inclusion and Constrained NonconvexNonconcave MinMax Optimization
(
Poster
)
link
SlidesLive Video
We study monotone inclusions and monotone variational inequalities, as well as their generalizations to nonmonotone settings. We first show that the \emph{Extra Anchored Gradient (EAG)} algorithm, originally proposed by [Yoon and Ryu, 2021] for unconstrained convexconcave minmax optimization, can be applied to solve the more general problem of Lipschitz monotone inclusion. More specifically, we prove that the EAG solves Lipschitz monotone inclusion problems with an \emph{accelerated convergence rate} of $O(\frac{1}{T})$, which is \emph{optimal among all firstorder methods} [Diakonikolas, 2020, Yoon and Ryu, 2021]. Our second result is an {accelerated forwardbackward splitting algorithm (AS),} which not only achieves the accelerated $O(\frac{1}{T})$ convergence rate for all monotone inclusion problems, but also exhibits the same accelerated rate for a family of general (nonmonotone) inclusion problems that concern negative comonotone operators. As a special case of our second result, AS enjoys the $O(\frac{1}{T})$ convergence rate for solving a nontrivial class of nonconvexnonconcave minmax optimization problems. Our analyses are based on simple potential function arguments, which might be useful for analysing other accelerated algorithms.

Yang Cai · Argyris Oikonomou · Weiqiang Zheng 🔗 


Accelerated Riemannian Optimization: Handling Constraints to Bound Geometric Penalties
(
Poster
)
link
SlidesLive Video We propose a globallyaccelerated, firstorder method for the optimization of smooth and (strongly or not) geodesicallyconvex functions in Hadamard manifolds. Our algorithm enjoys the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works resort to assuming that the iterates of their algorithms stay in some prespecified compact set, except for two previous methods, whose applicability is limited to local optimization and to spaces of constant curvature, respectively. Achieving global and general Riemannian acceleration without iterates assumptively staying in the feasible set was asked as an open question in (Kim & Yang, 2022), which we solve for Hadamard manifolds. In our solution, we show that we can use a linearly convergent algorithm for constrained strongly gconvex smooth problems to implement a Riemannian inexact proximal point operator that we use as a subroutine, which is of independent interest. 
David MartinezRubio · Sebastian Pokutta 🔗 


Gradient Descent: Robustness to Adversarial Corruption
(
Poster
)
link
SlidesLive Video Optimization using gradient descent (GD) is a ubiquitous practice in various machine learning problems including training large neural networks. Noisefree GD and stochastic GDcorrupted by random noisehave been extensively studied in the literature, but less attention has been paid to an adversarial setting, that is subject to adversarial corruptions in the gradient values. In this work, we analyze the performance of GD under a proposed general adversarial framework. For the class of functions satisfying the PolyakŁojasiewicz condition, we derive finite time bounds on a minimax optimization error. Based on this bound, we provide a guideline on the choice of learning rate sequence with theoretical guarantees on the robustness of GD against adversarial corruption. 
FuChieh Chang · Farhang Nabiei · PeiYuan Wu · Alexandru Cioba · Sattar Vakili · Alberto Bernacchia 🔗 


Boosting as FrankWolfe
(
Poster
)
link
SlidesLive Video
Some boosting algorithms, such as LPBoost, ERLPBoost, and CERLPBoost, aim to solve the soft margin optimization problem with the $\ell_1$norm regularization. LPBoost rapidly converges to an $\epsilon$approximate solution in practice, but it is known to take $\Omega(m)$ iterations in the worst case, where $m$ is the sample size.On the other hand, ERLPBoost and CERLPBoost are guaranteed to converge to an $\epsilon$approximate solution in $O(\frac{1}{\epsilon^2} \ln \frac{m}{\nu})$ iterations. However, the computation per iteration is very high compared to LPBoost. To address this issue, we propose a generic boosting scheme that combines the FrankWolfe algorithm and any secondary algorithm and switches one to the other iteratively. We show that the scheme retains the same convergence guarantee as ERLPBoost and CERLPBoost. One can incorporate any secondary algorithm to improve in practice.This scheme comes from a unified view of boosting algorithms for soft margin optimization. More specifically, we show that LPBoost, ERLPBoost, and CERLPBoost are instances of the FrankWolfe algorithm. In experiments on real datasets, one of the instances of our scheme exploits the better updates of the second algorithm and performs comparably with LPBoost.

Ryotaro Mitsuboshi · Kohei Hatano · Eiji Takimoto 🔗 


RandProx: PrimalDual Optimization Algorithms with Randomized Proximal Updates
(
Poster
)
link
SlidesLive Video Proximal splitting algorithms are well suited for largescale nonsmooth optimization problems. We propose a primaldual algorithm, in which the dual update is randomized, with the proximity operator of one of the function replaced by a stochastic oracle. A nonsmooth variancereduction technique is implemented so that the algorithm finds an exact minimizer of the general problem. We derive linear convergence results in presence of strong convexity. Several existing randomized algorithms, like PointSAGA, are recovered as particular cases. Randomness helps getting faster algorithms; this has long been known for stochasticgradienttype algorithms, and our work shows that this fully applies in the more general primaldual setting as well. 
Laurent Condat · Peter Richtarik 🔗 


A Unified Framework to Understand Decentralized and Federated Optimization Algorithms: A MultiRate Feedback Control Perspective
(
Poster
)
link
SlidesLive Video Distributed algorithms have been playing an increasingly important role in many applications such as machine learning, signal processing, and control. In this work, we provide a fresh perspective to understand, analyze, and design distributed optimization algorithms. Through the lens of multirate feedback control, we show that a wide class of distributed algorithms, including popular decentralized/federated schemes, can be viewed as discretizing a certain continuoustime feedback control system, possibly with multiple sampling rates, such as decentralized gradient descent, gradient tracking, and federated averaging. This key observation not only allows us to develop a generic framework to analyze the convergence of the entire algorithm class. More importantly, it also leads to an interesting way of designing new distributed algorithms. We develop the theory behind our framework and provide examples to highlight how the framework can be used in practice. 
xinwei zhang · Nicola Elia · Mingyi Hong 🔗 


Stochastic Gradient DescentAscent: Unified Theory and New Efficient Methods
(
Poster
)
link
SlidesLive Video Stochastic Gradient DescentAscent (SGDA) is one of the most prominent algorithms for solving minmax optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descentascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variancereduced method (LSVRGDA), new distributed methods with compression (QSGDA, DIANASGDA, VRDIANASGDA), and a new method with coordinate randomization (SEGASGDA). Although the variants of these methods were known for the minimization problems, they were never considered for solving minmax problems and VIPs. We also demonstrate the most important properties of the new methods through extensive numerical experiments. 
Aleksandr Beznosikov · Eduard Gorbunov · Hugo Berard · Nicolas Loizou 🔗 


Optimization using Parallel Gradient Evaluations on Multiple Parameters
(
Poster
)
link
SlidesLive Video We propose a firstorder method for convex optimization, where instead of being restricted to the gradient from a single parameter, gradients from multiple parameters can be used during each step of gradient descent. This setup is particularly useful when a few processors are available that can be used in parallel for optimization. Our method uses gradients from multiple parameters in synergy to update these parameters together towards the optima. While doing so, it is ensured that the computational and memory complexity is of the same order as that of gradient descent. Empirical results demonstrate that even using gradients from as low as \textit{two} parameters, our method can often obtain significant acceleration and provide robustness to hyperparameter settings. We remark that the primary goal of this work is less theoretical, and is instead aimed at exploring the understudied case of using multiple gradients during each step of optimization. 
Yash Chandak · Shiv Shankar · Venkata Gandikota · Philip Thomas · Arya Mazumdar 🔗 


An Accuracy Guaranteed Online Solver for Learning in Dynamic Feature Space
(
Poster
)
link
SlidesLive Video
We study the problem of adding or deleting features of data from machine learning models trained using empirical risk minimization. Our focus is on algorithms in an online manner which is capable for a more general regularization term, and present practical guides to two classical regularizers, i.e., the group Lasso and $\ell_p$norm regularizer. Across a variety of benchmark datasets, our algorithm improves upon the runtime of prior methods while maintaining the \emph{same} generalization accuracy.

Diyang Li · Bin Gu 🔗 


Adaptive Methods for Nonconvex Continual Learning
(
Poster
)
link
SlidesLive Video One of the objectives of continual learning is to prevent catastrophic forgetting in learning multiple tasks sequentially,and the existing solutions have been driven by the conceptualization of the plasticitystability dilemma.However, the convergence of continual learning for each sequential task is less studied so far.In this paper, we provide a convergence analysis of memorybased continual learning with stochastic gradient descentand empirical evidence that training current tasks causes the cumulative degradation of previous tasks.We propose an adaptive method for nonconvex continual learning (NCCL), which adjusts step sizes of both previous and current tasks with the gradients.The proposed method can achieve the same convergence rate as the SGD method when the catastrophic forgetting term which we define in the paper is suppressed at each iteration.Further, we demonstrate that the proposed algorithm improves the performance of continual learning over existing methods for several image classification tasks. 
Seungyub Han · Yeongmo Kim · Taehyun Cho · Jungwoo Lee 🔗 


Random initialisations performing above chance and how to find them
(
Poster
)
link
Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al. recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes. 
Frederik Benzing · Simon Schug · Robert Meier · Johannes von Oswald · Yassir Akram · Nicolas Zucchet · Laurence Aitchison · Angelika Steger 🔗 


On the Complexity of Finding Small Subgradients in Nonsmooth Optimization
(
Poster
)
link
SlidesLive Video
We study the oracle complexity of producing $(\delta,\epsilon)$stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimensionfree randomized algorithms for producing such points within $\widetilde{O}(1/\delta\epsilon^3)$ firstorder oracle calls, we show that no dimensionfree rate can be achieved by a deterministic algorithm. On the other hand, we point out that this rate can be derandomized for smooth functions with merely a logarithmic dependence on the smoothness parameter. Moreover, we establish several lower bounds for this task which hold for any randomized algorithm, with or without convexity. Finally, we show how the convergence rate of finding $(\delta,\epsilon)$stationary points can be improved in case the function is convex, a setting which we motivate by proving that in general no finite time algorithm can produce points with small subgradients even for convex functions.

Guy Kornowski · Ohad Shamir 🔗 


On the Complexity of Finding Small Subgradients in Nonsmooth Optimization
(
Oral
)
link
We study the oracle complexity of producing $(\delta,\epsilon)$stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimensionfree randomized algorithms for producing such points within $\widetilde{O}(1/\delta\epsilon^3)$ firstorder oracle calls, we show that no dimensionfree rate can be achieved by a deterministic algorithm. On the other hand, we point out that this rate can be derandomized for smooth functions with merely a logarithmic dependence on the smoothness parameter. Moreover, we establish several lower bounds for this task which hold for any randomized algorithm, with or without convexity. Finally, we show how the convergence rate of finding $(\delta,\epsilon)$stationary points can be improved in case the function is convex, a setting which we motivate by proving that in general no finite time algorithm can produce points with small subgradients even for convex functions.

Guy Kornowski · Ohad Shamir 🔗 


Solving a Special Type of Optimal Transport Problem by a Modified Hungarian Algorithm
(
Poster
)
link
SlidesLive Video
We observe that computing empirical Wasserstein distance in the independence test is an optimal transport (OT) problem with a special structure. This observation inspires us to study a special type of OT problem and propose a modified Hungarian algorithm to solve it exactly. For an OT problem between marginals with $m$ and $n$ atoms ($m\geq n$), the computational complexity of the proposed algorithm is $\mathcal{O}(m^2n)$. Computing the empirical Wasserstein distance in the independence test requires solving this special type of OT problem, where we have $m=n^2$. The associate computational complexity of our algorithm is $\mathcal{O}(n^5)$, while the order of applying the classic Hungarian algorithm is $\mathcal{O}(n^6)$. Numerical experiments validate our theoretical analysis. Broader applications of the proposed algorithm are discussed at the end.

Yiling Xie · Yiling Luo · Xiaoming Huo 🔗 


BOME! Bilevel Optimization Made Easy: A Simple FirstOrder Approach
(
Poster
)
link
SlidesLive Video Bilevel optimization (BO) is useful for solving a variety of important machine learning problems including but not limited to hyperparameter optimization, metalearning, continual learning, and reinforcement learning.Conventional BO methods need to differentiate through the lowlevel optimization process with implicit differentiation, which requires expensive calculations related to the Hessian matrix. There has been a recent quest for firstorder methods for BO, but the methods proposed to date tend to be complicated and impractical for largescale deep learning applications. In this work, we propose a simple firstorder BO algorithm that depends only on firstorder gradient information, requires no implicit differentiation, and is practical and efficient for largescale nonconvex functions in deep learning. We provide nonasymptotic convergence analysis of the proposed method to stationary points for nonconvex objectives and present empirical results that show its superior practical performance. 
Mao Ye · Bo Liu · Stephen Wright · Peter Stone · Qiang Liu 🔗 


On Convergence of AverageReward OffPolicy Control Algorithms in WeaklyCommunicating MDPs
(
Poster
)
link
We show two averagereward offpolicy control algorithms, Differential Q Learning (Wan, Naik, \& Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas \& Borkar 2001), converge in weaklycommunicating MDPs. Weaklycommunicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weaklycommunicating MDPs. To the best of our knowledge, our results are the first showing averagereward offpolicy control algorithms converge in weaklycommunicating MDPs. As a direct extension, we show that averagereward options algorithms introduced by (Wan, Naik, \& Sutton 2021b) converge if the SemiMDP induced by options is weaklycommunicating. 
Yi Wan · Richard Sutton 🔗 


Optimizing the Performative Risk under Weak Convexity Assumptions
(
Poster
)
link
SlidesLive Video
In performative prediction, a predictive model impacts the distribution that generates future data, a phenomenon that is being ignored in classical supervised learning. In this closedloop setting, the natural measure of performance, named performative risk ($\mathrm{PR}$), captures the expected loss incurred by a predictive model \emph{after} deployment. The core difficulty of using the performative risk as an optimization objective is that the data distribution itself depends on the model parameters.This dependence is governed by the environment and not under the control of the learner. As a consequence, even the choice of a convex loss function can result in a highly nonconvex $\mathrm{PR}$ minimization problem. Prior work has identified a pair of general conditions on the loss and the mapping from model parameters to distributions that implies convexity of the performative risk. In this paper, we relax these assumptions and focus on obtaining weaker notions of convexity, without sacrificing the amenability of the $\mathrm{PR}$ minimization problem for iterative optimization methods.

Yulai Zhao 🔗 


Quantization based Optimization : Alternative Stochastic Approximation of Global Optimization
(
Poster
)
link
SlidesLive Video In this study, we propose a global optimization algorithm based on quantizing the energy level of an objective function in an NPhard problem.According to the white noise hypothesis for a quantization error with a dense and uniform distribution, we can regard the quantization error as i.i.d. white noise. According to stochastic analysis, the proposed algorithm converges weakly only under conditions satisfying Lipschitz continuity, instead of local convergence properties such as the Hessian constraint of the objective function. This shows that the proposed algorithm ensures global optimization by Laplace's condition. Numerical experiments show that the proposed algorithm outperforms conventional learning methods in solving NPhard optimization problems such as the traveling salesman problem. 
Jinwuk Seok · Changsik Cho 🔗 


Completing the Model Optimization Process by Correcting Patterns of Failure in Regression Tasks
(
Poster
)
link
SlidesLive Video Model selection and hyperparameter optimization sometimes prove to be complex and costly processes with unfinished outcomes. In fact, a socalled optimized model can still suffer from patterns of failure when predicting on new data, affecting the generalization error. In this paper, we focus on regression tasks and introduce an additional stage to the model optimization process in order to render it more reliable. This new step aims to correct error patterns when the model makes predictions on unlabeled data. To that end, our method includes two techniques. AutoCorrect Rules leverage the model under/overestimation bias and applies simple rules to adjust predictions. AutoCorrect Model is a supervised approach which exploits different representations to predict residuals in order to revise model predictions. We empirically prove the relevance of our method on the outcome of an AutoML tool using different time budgets, and on a specific optimization case leveraging a pretrained model for an image regression task. 
Thomas Bonnier 🔗 


Private Stochastic Optimization With Large WorstCase Lipschitz Parameter: Optimal Rates for (NonSmooth) Convex Losses & Extension to NonConvex Losses
(
Poster
)
link
SlidesLive Video
We study differentially private (DP) stochastic optimization (SO) with data containing outliers and loss functions that are (possibly) not Lipschitz continuous. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it is often unrealistic: in many practical problems, the loss function may not be uniformly Lipschitz. Even when the loss function is Lipschitz continuous, the worstcase Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worstcase Lipschitz parameter of the loss, are vacuous. To address these limitations, this work does not require the loss function to be uniformly Lipschitz. Instead, building on a recent line of work (Wang et al., 2020; Kamath et al., 2022), we make the weaker assumption that stochastic gradients have bounded $k$th order moments for some $k \geq 2$. Compared with works on DP Lipschitz SO, our excess risk scales with the $k$th moment bound instead of the Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to the prior works, our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address nonconvex nonLipschitz loss functions satisfying the ProximalPL inequality; this covers some practical machine learning models. Our ProximalPL algorithm has nearoptimal excess risk.

Andrew Lowy · Meisam Razaviyayn 🔗 


Sufficient Conditions for Nonasymptotic Convergence of Riemannian Optimization Methods
(
Oral
)
link
Motivated by energy based analyses for descent methods in the Euclidean setting, we investigate a generalisation of such energy based analyses for descent methods over Riemannian manifolds. In doing so, we find that it is possible to derive curvaturefree guarantees for such descent methods, improving on work by Zhang and Sra [2016]. This analysis allows us to study acceleration of Riemannian gradient descent in the geodesicallyconvex setting, and improve on an existing result by Alimisis et al 2021]. Finally, extending the analysis by Ahn and Sra [2020], we attempt to provide some sufficient conditions for the acceleration of Riemannian descent methods in the strongly geodesically convex setting. 
Vishwak Srinivasan · Ashia Wilson 🔗 