Workshop
OPT 2023: Optimization for Machine Learning
Cristóbal Guzmán · Courtney Paquette · Katya Scheinberg · Aaron Sidford · Sebastian Stich
Hall D2 (level 1)
Fri 15 Dec, 7 a.m. PST
Optimization lies at the heart of many machine learning algorithms and enjoys great interest in our community. Indeed, this intimate relation of optimization with ML is the key motivation for the OPT series of workshops. We aim to foster discussion, discovery, and dissemination of stateoftheart research in optimization relevant to ML.
To foster the spirit of innovation and collaboration, a goal of this workshop, OPT 2023 will focus the contributed talks on research in "Optimization in the Wild"; this title is meant to encompass the new challenges that traditional optimization theory and algorithms face with the growth and variety of novel ML applications.
Successful applications of both theory and algorithms from optimization to ML frequently require a profound redesign or even entirely new approaches. This becomes apparent in settings where the classical (empirical) risk minimization approach is no longer sufficient to address the challenges of learning. As motivating examples, we consider the case of learning under (group or individual) fairness in distributed scenarios, learning under differential privacy, robustness, multitask and transfer learning, as well as sampling from logconcave distributions. On the other hand, novel neural network architectures (such as transformers) require exploiting its structures for efficient optimization in crucial ways. For these models and problems: What is the role of optimization? What synergies can be exploited with the insights coming from these particular areas towards more efficient and reliable solutions? We will foster discussions directed at developing understanding of these challenges, and raising awareness of the capabilities and risks of using optimization in each of these areas.
Schedule
Fri 7:00 a.m.  7:01 a.m.

Opening Remarks
(
Opening remarks
)
>
SlidesLive Video 
Cristóbal Guzmán 🔗 
Fri 7:00 a.m.  7:30 a.m.

DoG is SGD’s best friend: toward tuningfree stochastic optimization, Yair Carmon
(
Plenary speaker
)
>
SlidesLive Video Abstract: While stochastic optimization methods drive continual improvements in machine learning, choosing the optimization parameters—and particularly the learning rate (LR)—remains a difficulty. In this talk, I will describe our work on removing LR tuning from stochastic gradient descent (SGD), culminating in a tuningfree dynamic SGD step size formula, which we call Distance over Gradients (DoG). We show that DoG removes the need to tune learning rate both theoretically (obtaining strong parameterfree convergence guarantees) and empirically (performing nearly as well as expensivelytuned SGD on neural network training tasks). 
Yair Carmon 🔗 
Fri 7:30 a.m.  8:00 a.m.

Contributed Talks 1: *Escaping mediocrity: how twolayer networks learn hard generalized linear models* and *Last Iterate Convergence of Popov Method for Nonmonotone Stochastic Variational Inequalities*
(
Contributed talks
)
>
SlidesLive Video Two 15 minute talks:

Bruno Loureiro · Daniil Vankov · Courtney Paquette 🔗 
Fri 8:00 a.m.  9:00 a.m.

Poster Session 1
(
Break
)
>
Posters in this session

Egor Shulgin · Mingzhen He · Hanmin Li · Thibault Lahire · Eric Zelikman · Damien Scieur · Rajat Vadiraj Dwaraknath · Gene Li · Zhanhong Jiang · Rahul Jain · Zihan Zhou · Tianyue Zhang · Ilyas Fatkhullin · Frederik Kunstner · Utkarsh Singhal · Bruno Loureiro · Krishna C Kalagarla · Kai Liu · Michal Derezinski · Ross Clarke · Dimitri Papadimitriou · Mo Zhou · Jörg Franke · Chandler Smith · Darshan Chakrabarti · Trang H. Tran · Mokhwa Lee · Wei Kuang · Vincent Roulet · John Lazarsfeld · Donghyun Oh · Yihe Deng · Fu Wang · Junchi YANG · Dániel Rácz · Jeffrey Flanigan · Aaron Mishkin · Luca Scharr · Robert Gower · Chaoyue Liu · Yushen Huang · Nicholas Recker

Fri 9:00 a.m.  9:30 a.m.

Contributed Talks 2: *An Algorithm with Optimal DimensionDependence for ZeroOrder Nonsmooth Nonconvex Stochastic Optimization* and *Practical Principled Policy Optimization for Finite MDPs*
(
Contributed talks
)
>
SlidesLive Video Two 15 minute talks:

Guy Kornowski · Michael Lu · Aaron Sidford 🔗 
Fri 9:30 a.m.  10:00 a.m.

Aiming towards the minimizers: fast convergence of SGD for overparameterized problems, Dmitriy Drusvyatskiy
(
Plenary speaker
)
>
SlidesLive Video Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worstcase iteration complexity as the deterministic gradient method, while using only a small minibatch of sampled gradients in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer. 
Dmitriy Drusvyatskiy 🔗 
Fri 10:00 a.m.  12:00 p.m.

Lunch
(
Break
)
>

🔗 
Fri 12:00 p.m.  12:30 p.m.

Evaluating LargeScale Learning Systems, Virginia Smith
(
Plenary speaker
)
>
SlidesLive Video Abstract: To deploy machine learning models in practice it is critical to have a way to reliably evaluate their effectiveness. Unfortunately, the scale and complexity of modern machine learning systems makes it difficult to provide faithful evaluations and gauge performance across potential deployment scenarios. In this talk I discuss our work addressing challenges in largescale ML evaluation. First, I explore the problem of hyperparameter optimization in federated networks of devices, where issues of device subsampling, heterogeneity, and privacy can introduce noise in the evaluation process and make it challenging to effectively perform optimization. Second, I present ReLM, a system for validating and querying large language models (LLMs). Although LLMs have been touted for their ability to generate naturalsounding text, there is a growing need to evaluate the behavior of LLMs in light of issues such as data memorization, bias, and inappropriate language. ReLM poses LLM validation queries as regular expressions to enable faster and more effective LLM evaluation. 
Virginia Smith 🔗 
Fri 12:30 p.m.  1:00 p.m.

Contributed Talks 3: *Dueling Optimization with a Monotone Adversary* and *HighDimensional Prediction for Sequential Decision Making*
(
Contributed talks
)
>
SlidesLive Video Two 15 minute talks:

Naren Manoj · Georgy Noarov · Cristóbal Guzmán 🔗 
Fri 1:00 p.m.  2:00 p.m.

Poster Session 2
(
Break
)
>
Posters in this session:

XiaoYang Liu · Guy Kornowski · Philipp Dahlinger · Abbas Ehsanfar · Binyamin Perets · David MartinezRubio · Sudeep Raja Putta · Runlong Zhou · Connor Lawless · Julian J Stier · Chen Fan · Michal Šustr · James Spann · Jung Hun Oh · Yao Xie · Qi Zhang · Krishna Acharya · Sourabh Medapati · Sharan Vaswani · Sruthi Gorantla · Mohamed Elsayed · Hongyang Zhang · Reza Asad · Viktor Pavlovic · Betty Shea · Georgy Noarov · Chuan He · Daniil Vankov · Taoan Huang · Michael Lu · Anant Mathur · Konstantin Mishchenko · Stanley Wei · Francesco Faccio · Yuchen Zeng · Tianyue Zhang · Chris Junchi Li · Aaron Mishkin · Sina Baharlouei · Chen Xu · Sasha Abramowitz · Sebastian Stich · Felix Dangel

Fri 2:00 p.m.  2:30 p.m.

Sharply predicting the behavior of complex iterative algorithms with random data, Ashwin Pananjady
(
Plenary speaker
)
>
SlidesLive Video Abstract: Iterative algorithms are the workhorses of modern signal processing and statistical learning, and are widely used to fit complex models to random data. While the choice of an algorithm and its hyperparameters determines both the speed and fidelity of the learning pipeline, it is common for this choice to be made heuristically, either by expensive trialanderror or by comparing upper bounds on convergence rates of various candidate algorithms. Motivated by these issues, we develop a principled framework that produces sharp, iteratebyiterate characterizations of solution quality for complex iterative algorithms on several nonconvex modelfitting problems with random data. Such sharp predictions can provide precise separations between families of algorithms while also revealing nonstandard convergence phenomena. We will showcase the general framework on several canonical models in statistical machine learning. 
Ashwin Pananjady 🔗 
Fri 2:30 p.m.  3:00 p.m.

Provable Feature Learning in Gradient Descent, Jason Lee
(
Plenary speaker
)
>
SlidesLive Video
**Abstract:** We focus on the task of learning a single index model $\sigma(w* x)$ with respect to the isotropic Gaussian distribution in d dimensions, including the special case when $\sigma$ is a kth order hermite which corresponds to the Gaussian analog of parity learning. Prior work has shown that the sample complexity of learning w* is governed by the *information exponent* k* of the link function \sigma, which is defined as the index of the first nonzero Hermite coefficient of $\sigma$. Prior upper bounds have shown that n > d^{k*1} samples suffice for learning w* and that this is tight for online SGD (Ben Arous et al., 2020). However, the CSQ lower bound for gradient based methods only shows that n > d^{k*/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns w* with n > d^{k*/2}$ samples.
Next, we turn to the problem of learning multi index models f(x) = g(Ux), where U encodes a latent representation of low dimension. Significant prior work has established that neural networks trained by gradient descent behave like kernel methods, despite significantly worse empirical performance of kernel methods. However, in this work we demonstrate that for this large class of functions that there is a large gap between kernel methods and gradient descent on a twolayer neural network, by showing that gradient descent learns representations relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f*(x)=g(Ux) where U is d by r. When the degree of f* is p, it is known that n≍dp samples are necessary to learn f* in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f*. This results in an improved sample complexity of n≍d^2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d.

Jason Lee 🔗 
Fri 3:00 p.m.  3:01 p.m.

Closing Remarks
(
Closing
)
>

Cristóbal Guzmán 🔗 


DetCGD: Compressed Gradient Descent with Matrix Stepsizes for NonConvex Optimization
(
Poster
)
>
link
This paper introduces a new method for minimizing matrixsmooth nonconvex objectives through the use of novel Compressed Gradient Descent (CGD) algorithms enhanced with a matrixvalued stepsize. The proposed algorithms are theoretically analyzed first in the singlenode and subsequently in the distributed settings. Our theoretical results reveal that the matrix stepsize in CGD can capture the objective’s structure and lead to faster convergence compared to a scalar stepsize. As a byproduct of our general results, we emphasize the importance of selecting the compression mechanism and the matrix stepsize in a layerwise manner, taking advantage of model structure. Moreover, we provide theoretical guarantees for free compression, by designing specific layerwise compressors for the nonconvex matrix smooth objectives. Our findings are supported with empirical evidence. 
Hanmin Li · Avetik Karagulyan · Peter Richtarik 🔗 


Accelerated Methods for Riemannian MinMax Optimization Ensuring Bounded Geometric Penalties
(
Poster
)
>
link
In this work, we study optimization problems of the form $\min_x \max_y f(x, y)$, where $f(x, y)$ is defined on a product Riemannian manifold $\mathcal{M} \times \mathcal{N}$ and is $\mu_x$strongly geodesically convex (gconvex) in $x$ and $\mu_y$strongly gconcave in $y$, for $\mu_x, \mu_y \geq 0$. We design accelerated methods when $f$ is $(L_x, L_y, L_{xy})$smooth and $\mathcal{M}$, $\mathcal{N}$ are Hadamard. To that aim we introduce new gconvex optimization results, of independent interest: we show global linear convergence for metricprojected Riemannian gradient descent and improve existing accelerated methods by reducing geometric constants. Additionally, we complete the analysis of two previous works applying to the Riemannian minmax case by removing an assumption about iterates staying in a prespecified compact set.

David MartinezRubio · Christophe Roux · Christopher Criscitiello · Sebastian Pokutta 🔗 


Risk Bounds of Accelerated SGD for Overparameterized Linear Regression
(
Poster
)
>
link
Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning. While existing optimization theory can explain its faster convergence, they fall short in explaining its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression. We establish an instancedependent excess risk bound for ASGD within each eigensubspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our resultsuggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. 
Xuheng Li · Yihe Deng · Jingfeng Wu · Dongruo Zhou · Quanquan Gu 🔗 


Follow the flow: Proximal flow inspired multistep methods
(
Poster
)
>
link
We investigate a family of MultiStep Proximal Point Methods, the Backwards Differentiation Formulas, which are inspired by implicit linear discretization of gradient flow. The resulting methods are multistep proximal point methods, with similar computational cost in each update as theproximal point method. We explore several optimization methods where applying an approximatemultistep proximal points method results in improved convergence behavior. We argue that this isthe result of the lowering of truncation error in approximating gradient flow. 
Yushen Huang · Yifan Sun 🔗 


A Predicting Clipping Asynchronous Stochastic Gradient Descent Method in Distributed Learning
(
Poster
)
>
link
In this paper, we propose a new algorithm, termed Predicting Clipping Asynchronous Stochastic Gradient Descent (aka, PCASGD) to address the issue of staleness and time delay in asynchronous distributed learning settings. Specifically, PCASGD has two steps  the predicting step leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights whilethe clipping step selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. We theoretically present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are nonconvex. For empirical validation, we demonstrate the performance of the algorithm with two deep neural network architectures on two benchmark datasets. 
Haoxiang Wang · Zhanhong Jiang · Chao Liu · Soumik Sarkar · Dongxiang Jiang · Young Lee 🔗 


Last Iterate Convergence of Popov Method for Nonmonotone Stochastic Variational Inequalities
(
Oral
)
>
link
This paper focuses on nonmonotone stochastic variational inequalities (SVIs) that may not have a unique solution. A commonly used efficient algorithm to solve VIs is the Popov method, which is known to have the optimal convergence rate for VIs with Lipschitz continuous and strongly monotone operators. We introduce a broader class of structured nonmonotone operators, namely *$p$quasisharp* operators *$p> 0$*, which allows tractably analyzing convergence behavior of algorithms. We show that the stochastic Popov method converges \emph{almost surely} to a solution for all operators from this class under a *linear growth*. In addition, we obtain the last iterate convergence rate (in expectation) for the method under a *linear growth* condition for $2$quasisharp operators. Based on our analysis, we refine the results for smooth $2$quasisharp and $p$quasisharp operators (on a compact set), and obtain the optimal convergence rates.

Daniil Vankov · Angelia Nedich · Lalitha Sankar 🔗 


Generalisable Agents for Neural Network Optimisation
(
Poster
)
>
link
Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO)a multiagent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome. 
Kaleab Tessera · Callum R. Tilbury · Sasha Abramowitz · Ruan John de Kock · Omayma Mahjoub · Benjamin Rosman · Sara Hooker · Arnu Pretorius 🔗 


Accelerated gradient descent: A guaranteed bound for a heuristic restart strategy
(
Poster
)
>
link
The $\mathcal{O}(1/k^2)$ convergence rate in function value of accelerated gradient descent is optimal, but there are many modifications that have been used to speed up convergence in practice. Among these modifications are restarts, that is, starting the algorithm anew with the current iteration being considered as the initial point. We focus on the adaptive restart techniques introduced by O'Donoghue and Candes, specifically their gradient restart strategy. While the gradient restart strategy is a heuristic in general, we prove that applying gradient restarts preserves and in fact improves the $\mathcal{O}(1/k^2)$ bound, hence establishing function value convergence, for onedimensional functions.Applications of our results to separable and nearly separable functions are presented.

Walaa Moursi · Stephen Vavasis · Viktor Pavlovic 🔗 


Adagrad Promotes Diffuse Solutions In Overparameterized Regimes
(
Poster
)
>
link
With the high use of overparameterized data in deep learning, the choice of optimizer in training plays a big role in a model's generalization ability due to solution selection bias. This work focuses on the adaptive gradient optimizer Adagrad, in the overparameterized leastsquares regime. We empirically find that when using sufficiently small step sizes, Adagrad promotes diffuse solutions in the sense of uniformity among the coordinates of the solution. Additionally, we theoretically show that Adagrad's solution, under the same conditions, exhibits greater diffusion compared to the solution obtained through gradient descent (GD) by analyzing the ratio of their updates. Lastly, we empirically compare the performance of Adagrad and GD on generated datasets. We observe a consistent trend that Adagrad promotes more diffused solutions, which aligns with our theoretical analysis. 
Andrew Rambidis · Jiayi Wang 🔗 


ModelFree, RegretOptimal Best Policy Identification in Online CMDPs
(
Poster
)
>
link
This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are modelfree, have low regret, and identify an optimal policy with a high probability. Existing modelfree algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named PruningRefinementIdentification (PRI), based on a fundamental structural property of CMDPs we discover, called limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then finetunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a modelfree algorithm; and (ii) it outputs a nearoptimal policy with a high probability at the end of learning; and (iii) in the tabular setting, PRI guarantees $\tilde{\mathcal{O}}(\sqrt{K})$ regret and constraint violation, which significantly improves the best existing regret bound $\tilde{\mathcal{O}}(K^{\frac{4}{5}})$ under a modelfree algorithm, where $K$ is the total number of episodes.

Zihan Zhou · Honghao Wei · Lei Ying 🔗 


Reducing Predict and Optimize to Convex Feasibility
(
Poster
)
>
link
Numerous applications in operations research and computer science require a combination of prediction and optimization  use historical data to predict the parameters of an optimization problem, and solve the optimization problem to output a decision. Addressing these two challenges independently results in the \emph{predictthenoptimize} problem. This approach can result in discrepancies between the prediction error, minimized during training, and the ultimate objective of minimizing the decision error. Consequently, recent work has focused on the \emph{predict and optimize} (PO) framework which focuses on training an endtoend model from the data to the decisions. We focus on linear programs (LPs) within the PO framework where the main challenge is handling the nondifferentiability of LPs. For a linear prediction model, we present a novel reduction from PO to a convex feasibility problem. This reduction enables us to use alternating projections onto convex sets for solving the PO problem, resulting in a computationallyefficient and theoretically principled algorithm. Finally, we validate the effectiveness of our approach on synthetic shortest path and fractional knapsack problems, demonstrating improved performance compared to the prior work. 
Saurabh Mishra · Sharan Vaswani 🔗 


Diversityadjusted adaptive step size
(
Poster
)
>
link
Optimizing machine learning models often requires careful tuning of parameters, especially the learning rate. Traditional methods involve exhaustive searches or adopting preestablished rates, both with drawbacks. The former is computationally intensive, a concern amplified by the trend toward larger models like large language models (LLM). The latter risks suboptimal model training. Consequently, there’s growing research on adaptive and parameterfree approaches to reduce reliance on manual step size tuning. While adaptive gradient methods like AdaGrad, RMSProp, and Adam aim to adjust learning rates dynamically, they are still reliant on learning rate parameters dependent on problemspecific characteristics. Our work explores the interplay between step size and gradient dissimilarity, introducing a ”Diversity adjusted adaptive step” that adapts to different levers of dissimilarity in sampled gradients within the SGD algorithm. We also provide approximate algorithms to compute this step size efficiently while maintaining performance. 
Parham Yazdkhasti · Xiaowen Jiang · Sebastian Stich 🔗 


Global CFR: MetaLearning in SelfPlay Regret Minimization
(
Poster
)
>
link
In realworld situations, players often encounter a distribution of similar but distinct games, like poker games with different public cards or trading varied correlated stock market assets. While these games exhibit related equilibria, current literature mainly delves into single games or their repeated versions. (Sychrovsky et al. 2023) recently introduced offline metalearning to accelerate equilibrium discovery for such distributions in a singleplayer online setting. We build upon this, extending to a twoplayer zerosum selfplay setting. Our method uniquely integrates information for next strategy selection for both players across all decision states, promoting global communication as opposed to the traditional local regret decomposition. Evaluations on distributions of matrix and sequential games reveal our metalearned algorithms surpass their nonmetalearned variants. 
David Sychrovský · Michal Sustr · Michael Bowling · Martin Schmid 🔗 


Noise Injection Irons Out Local Minima and Saddle Points
(
Poster
)
>
link
Nonconvex optimization problems are ubiquitous in machine learning, especially in Deep Learning. It has been observed in practice, that injecting artificial noise into stochastic gradient descent (SGD) can sometimes improve training and generalization performance.In this work, we formalize noise injection as a smoothing operator and (review and derive) convergence guarantees of SGD under smoothing. We empirically found that Gaussian smoothing works really well for training twolayer neural networks, but these findings to not translate to deeper nets. We would like to use this contribution to stimulate a discussion in the community to further investigate the impact of noise in training machine learning models. 
Konstantin Mishchenko · Sebastian Stich 🔗 


How to Guess a Gradient
(
Poster
)
>
link
What can you say about the gradient of a neural network without \emph{computing a loss} or \emph{knowing the label?} This may sound like a strange question: surely the answer is ``very little.'' However, in this paper, we show that gradients are more structured than previously thought. They lie in a predictable lowdimensional subspace which depends on the network architecture and incoming features.Exploiting this structure can significantly improve gradientfree optimization schemes based on directional derivatives, which until now have struggled to scale beyond small networks trained on MNIST. We study how to narrow the gap in optimization performance between methods that calculate exact gradients and those that use directional derivatives, demonstrate new phenomena that occur when using these methods, and highlight new challenges in scaling these methods. 
Utkarsh Singhal · Brian Cheung · Kartik Chandra · Jonathan RaganKelley · Josh Tenenbaum · Tomaso Poggio · Stella X. Yu 🔗 


Stochastic FISTA Step Search Algorithm for Convex Optimization
(
Poster
)
>
link
In this paper, we propose an accelerated stochastic step search algorithm which combines an accelerated method with a fully adaptive step size parameter for convex problems in (Scheinberg et. al., 2014) with stochastic step search analysis in (Paquette and Scheinberg, 2020). Under appropriate conditions on the accuracy of the estimates of gradient and function value our algorithm achieves expected iteration complexity of $\mathcal{O}(1/\sqrt{\epsilon})$ to reach an $\epsilon$accurate solution which satisfies $f(x)  f_* \leq \epsilon$. This complexity matches with the iteration complexity of deterministic Nesterov's accelerated and FISTA algorithms (Nesterov, 1983, Beck and Teboulle, 2009). This paper continues the line of work on stochastic adaptive algorithms studied in (Berahas et. al., 2021, Blanchet et. al., 2019, Paquette and Scheinberg, 2020) and is the first one to develop an accelerated gradient descent type algorithm in this domain.

Trang H. Tran · Lam Nguyen · Katya Scheinberg 🔗 


KSpin Ising Model for Combinatorial Optimizations over Graphs: An Reinforcement Learning Approach
(
Poster
)
>
link
Many combinatorial optimization problems are defined on graphs and are difficult to solve due to the existence of many local optima. In this paper, we propose a Kspin Ising model for graphbased combinatorial optimization (GCO) problems inspired by the perspective of curriculum learning, which guide the policy neural networks to converge to highquality solutions. First, we propose a \textit{Kspin Ising model} to minimize its \textit{Hamiltonian} in reinforcement learning (RL) over multiple steps on graphs, which is generic for GCO problems and easy to be used. Second, we provide general Kspin Ising formulations for NPcomplete problems, and present detailed formulations together with interpretations for the classical graph maxcut problem. Third, we evaluate our RL approach based on both synthetic and benchmark datasets in the graph maxcut problem. In the synthetic dataset, our approach has a litter better (or the same) performance than (as) the commercial solver Gurobi \cite{2023gurobi} in smallscale scenarios (e.g., $100 \sim 3,000$ nodes) with hundreds of speedup, and outperforms Gurobi (with the improvement of $10\%$) in largescale scenarios (e.g., $5,000 \sim 10,000$ nodes) with tens of speedup. In the benchmark dataset, our approach obtains better (or the same) performance than five comparison approaches.

XiaoYang Liu · Ming Zhu 🔗 


ParameterAgnostic Optimization under Relaxed Smoothness
(
Poster
)
>
link
In training machine learning models, the tuning of hyperparameters such as the stepsize is both timeconsuming and intricate. To address this challenge, many adaptive optimization algorithms have been developed to achieve nearoptimal complexities, even when stepsizes are independent of problem parameters, provided the function is $L$smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$smoothness, all current convergence results still necessitate tuning the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum can achieve a nearoptimal complexity without prior knowledge of any problem parameter, though this introduces an exponential term dependent on $L_1$. We further establish that this term is inescapable to such schemes. Interestingly, in deterministic settings, this exponential factor can be negated using Gradient Descent with a Backtracking Line Search. To our knowledge, these represent the first parameteragnostic convergence result for this generalized smoothness paradigm.

Florian Hübler · Junchi YANG · Xiang Li · Niao He 🔗 


Escaping mediocrity: how twolayer networks learn hard generalized linear models
(
Oral
)
>
link
This study explores the sample complexity for twolayer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is wellestablished that in this scenario $n=O(d\log d)$ samples are typically needed. However, we provide precise results concerning the prefactors in highdimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.

Luca Arnaboldi · Florent Krzakala · Bruno Loureiro · Ludovic Stephan 🔗 


The Expressive Power of LowRank Adaptation
(
Poster
)
>
link
*LowRank Adaptation* (LoRA), a parameterefficient finetuning method that leverages lowrank adaptation of weight matrices, has emerged as a prevalent technique for finetuning pretrained models such as large language models and diffusion models.Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRArank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRArank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank$(\frac{\text{embedding size}}{2})$ LoRA adapters.

Yuchen Zeng · Kangwook Lee 🔗 


FaDE: Fast DARTS Estimator on Hierarchical NAS Spaces
(
Poster
)
>
link
Vast search spaces and expensive architecture evaluations make neural architecture search a challenging problem.Hierarchical search spaces allow for comparatively cheap evaluations on neural network sub modules to serve as surrogate for architecture evaluations.Yet, sometimes the hierarchy is too restrictive or the surrogate fails to generalize. We present FaDE that uses differentiable architecture search to iteratively obtain relative performance predictions on finite regions of a hierarchical neural architecture search space.The relative nature of these ranks calls for a memoryless, batchwise outer search algorithm which we provide in form of an evolutionary approach that features a pseudogradient descent.FaDE is especially suited on deep hierarchical respectively multicell search spaces which it can explore by linear instead of exponential cost and therefore eliminates the need for a proxy search space.FaDE solely trains on the neural architecture search space, not on any space of neural architecture sub modules.Our experiments show that firstly, FaDEranks on finite regions of the search space correlate with corresponding architecture performances and secondly, the ranks can empower a pseudogradient evolutionary search on the complete neural architecture search space. 
Simon Neumeyer · Julian J Stier · Michael Granitzer 🔗 


Nesterov Meets Robust Multitask Learning Twice
(
Poster
)
>
link
In this paper, we study temporal multitask learning problem where we impose smoothness constraint on timeseries weights. Besides, to select important features, group lasso is introduced. Moreover, the regression loss in each time frame is nonsquared to alleviate the influence of various scales of noise in each task, in addition to the nuclear norm for lowrank property. We first formulate the objective as a maxmin problem, where the dual variable can be optimized via accelerated dual ascent method, while the primal variable can be solved via smoothed Fast Iterative ShrinkageThresholding Algorithm (SFISTA). We provide convergence analysis of the proposed method and experiments demonstrate its effectiveness. 
Yifan Kang · Kai Liu 🔗 


On the Interplay Between Stepsize Tuning and Progressive Sharpening
(
Poster
)
>
link
Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a \emph{fixed} stepsize [Cohen et al, 2022]. We investigate empirically how the sharpness evolves when using stepsizetuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch may be well explained by its tendency to everincrease the sharpness of the objective in the full or large batch regimes. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, while outperforming its Armijo and constant stepsizes counterparts. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness. 
Vincent Roulet · Atish Agarwala · Fabian Pedregosa 🔗 


Why Adam Outperforms Gradient Descent on Language Models: A HeavyTailed Class Imbalance Problem
(
Poster
)
>
link
We show that the heavytailed class imbalance found in language modeling tasks leads to difficulties in optimization dynamics. When training with gradient descent, the loss associated with lowfrequency classes decreases slower than the loss associated with high frequency classes. Under theheavytailed class imbalance found in language modeling tasks, most samples are from classes oflow relative frequency, leading to overall slow decreasing on the average loss. Signbased optimizerssuch as Adam and sign descent do not suffer from this problem, and lead to decrease on all classes.We give evidence of this behavior on training for a 2layer transformer on language data, a linearmodel on synthetic data whose only property is a heavytailed class distribution, and a convolutionalnetwork on a modified MNIST dataset made to exhibit heavytailed class imbalance. 
Robin Yadav · Frederik Kunstner · Mark Schmidt · Alberto Bietti 🔗 


Level Set Teleportation: the Good, the Bad, and the Ugly
(
Poster
)
>
link
We study level set teleportation, an optimization subroutine which seeks to accelerate gradient methods by maximizing the gradient along the levelcurve of parameters with the same objective value. Since the descent lemma implies that gradient descent decreases the objective proportional to the squared norm of the gradient, levelset teleportation maximizes the onestep progress guarantee. We prove that levelset teleportation neither improves nor worsens the convergence of gradient descent for strongly convex functions, while for convex functions teleportation can move iterates arbitrarily far from the global minimizers. To evaluate teleportation in practice, we develop a projectedgradienttype method requiring only Hessianvector products. We use this method to show that initializing gradient methods using level set teleportation slightly underperforms standard initializations for both convex and nonconvex optimization problems. As a result, we report a mixed picture: teleportation can be efficiently evaluated, but it appears to offer marginal gains. 
Aaron Mishkin · Alberto Bietti · Robert Gower 🔗 


An alternative approach to train neural networks using monotone variational inequality
(
Poster
)
>
link
We investigate an alternative approach to neural network training, which is a nonconvex optimization problem, through the lens of another convex problem — to solve a monotone variational inequality (MVI)  inspired by the work of [Juditsky and Nemirovsky, 2019]. MVI solutions can be found by computationally efficient procedures, with performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery and prediction accuracy under the theoretical setting of training a singlelayer linear neural network. We study the use of MVI for training multilayer neural networks by proposing a practical and completely general algorithm called \textit{stochastic variational inequality} (\texttt{SVI}). We demonstrate its applicability in training networks with various architectures (\texttt{SVI} is completely general for training any network). We show the competitive or better performance of \texttt{SVI} compared to the widelyused stochastic gradient descent method (SGD) on both synthetic and real data prediction tasks regarding various performance metrics, especially in the improved efficiency in the early stage of training.

Chen Xu · Xiuyuan Cheng · Yao Xie 🔗 


Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation
(
Poster
)
>
link
The model in constrained Markov decision processes (CMDPs) is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as the knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our algorithm is based on a primaldual approach. We establish a sublinear $\tilde{\mathcal{O}}\left(H^{2.5} \sqrt{\mathcal{S}^2 \mathcal{A} K} \right)$ upper bound on the Bayesian reward objective regret along with a \emph{bounded}, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $\mathcal{S}$state, $\mathcal{A}$action, and horizon $H$ CMDP.

Krishna C Kalagarla · Rahul Jain · Pierluigi Nuzzo 🔗 


AverageConstrained Policy Optimization
(
Poster
)
>
link
Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average criterionconstrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other stateoftheart algorithms adapted for the average CMDP setting. 
Akhil Agnihotri · Rahul Jain · Haipeng Luo 🔗 


A novel analysis of gradient descent under directional smoothness
(
Poster
)
>
link
We develop new suboptimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization, rather than on global, worstcase constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upperbounds on the objective. Minimizing these upperbounds requires solving an implicit equation to obtain an adapted stepsize; we show that this equation is straightforward to solve for convex quadratics and leads to new guarantees for a classical stepsize sequence. For general functions, we prove that exponential search can be used to obtain a pathdependent convergence guarantee with only a loglog dependency on the global smoothness constant. Experiments on quadratic functions showcase the utility of our theory and connections to the edgeofstability phenomenon. 
Aaron Mishkin · Ahmed Khaled · Aaron Defazio · Robert Gower 🔗 


The Sharp Power Law of Local Search on Expanders
(
Poster
)
>
link
Local search is a powerful heuristic in optimization and computer science, the complexity of which has been studied in the white box and black box models. In the black box model, we are given a graph $G = (V,E)$ and oracle access to a function $f : V \to \mathbb{R}$. The local search problem is to find a vertex $v$ that is a local minimum, i.e. with $f(v) \leq f(u)$ for all $(u,v) \in E$, using as few queries to the oracle as possible. The query complexity is well understood on the grid and the hypercube, but much less is known beyond. We show that the query complexity of local search on $d$regular expanders with constant degree is $\Omega\left(\frac{\sqrt{n}}{\log{n}}\right)$, where $n$ is the number of vertices of the graph. This matches within a logarithmic factor the upper bound of $\mathcal{O}(\sqrt{n})$ for constant degree graphs from \cite{aldous1983minimization}, implying that steepest descent with a warm start is essentially an optimal algorithm for expanders.We obtain this result by considering a broader framework of graph features such as vertex congestion and separation number. We show that for each graph, the randomized query complexity of local search is $\Omega\left(\frac{n^{1.5}}{g}\right)$, where $g$ is the vertex congestion of the graph; and $\Omega\left(\sqrt[4]{\frac{s}{\Delta}}\right)$, where $s$ is the separation number and $\Delta$ is the maximum degree. For separation number the previous bound was $\Omega\left(\sqrt[8]{\frac{s}{\Delta}} /\log{n}\right)$, given by \cite{santha2004quantum} for {quantum} and randomized algorithms.To prove these results, we design a variant of the relational adversary method from \cite{Aaronson06}. Our variant is asymptotically at least as strong as the version in \cite{Aaronson06} for all randomized algorithms, as well as strictly stronger on some problems and easier to apply in our setting.

Nicholas Recker · Simina Branzei · Davin Choo 🔗 


Regret Bounds for Optimistic Follow The Leader: Applications in Portfolio Selection and Linear Regression
(
Poster
)
>
link
FollowTheLeader (FTL) is a simple online learning algorithm that is often overlooked. This paper investigates the FTL algorithm and two of its variant. The Optimistic FTL (OFTL) algorithm in the context of online learning with hints and the Follow The Approximate Leader (FTAL) with adaptive curvature. We provide a general regret inequality for OFTL that explicitly captures the effect of the hints and the curvature of the cost functions. This directly leads to a regret bound for FTAL. We generalize prior regret bounds of FTAL by incorporating adaptive curvature and movement of the iterates. We demonstrate the applicability of our results by deriving regret bounds for the online portfolio selection problem using FTAL with adaptive curvature. We further show the applicability of OFTL by obtaining a uniform regret bound for online linear regression. Our analysis contributes to a better understanding of FTL and its variants in various online learning scenarios. 
Sudeep Raja Putta · Shipra Agrawal 🔗 


BanditDriven Batch Selection for Robust Learning under Label Noise
(
Poster
)
>
link
We introduce a novel approach for batch selection in Stochastic Gradient Descent (SGD) training, leveraging combinatorial bandit algorithms. Our methodology focuses on optimizing the learning process in the presence of label noise, a prevalent issue in realworld datasets. Experimental evaluations on the CIFAR10 dataset reveal that our approach consistently outperforms existing methods across various levels of label corruption. Importantly, we achieve this superior performance without incurring the computational overhead commonly associated with auxiliary neural network models. This work presents a balanced tradeoff between computational efficiency and model efficacy, offering a scalable solution for complex machine learning applications. 
Michal Lisicki · Mihai Nica · Graham Taylor 🔗 


Practical Principled Policy Optimization for Finite MDPs
(
Oral
)
>
link
We consider (stochastic) softmax policy gradient (PG) methods for finite Markov Decision Processes (MDP). While the PG objective is not concave, recent research has used smoothness and gradient dominance to achieve convergence to an optimal policy. However, these results depend on having extensive knowledge of the environment, such as the optimal action or the true mean reward vector, to configure the algorithm parameters. This makes the resulting algorithms impractical in real applications. To alleviate this problem, we propose PG methods that employ an Armijo linesearch in the deterministic setting and an exponentially decreasing stepsize in the stochastic setting. We demonstrate that these proposed algorithms offer similar theoretical guarantees as previous works but now do not require the knowledge of oraclelike quantities. Furthermore, we apply the similar techniques to develop practical, theoretically sound entropyregularized methods for both deterministic and stochastic settings. Finally, we empirically compare the proposed methods with previous approaches in singlestate MDP environments. 
Michael Lu · Matin Aghaei · Anant Raj · Sharan Vaswani 🔗 


Adaptive Gradient Methods at the Edge of Stability
(
Poster
)
>
link
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the fullbatch and sufficiently large batch settings. Specifically, we empirically demonstrate that during fullbatch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value  the stability threshold of a gradient descent algorithm. For Adam with step size η and β1=0.9, this stability threshold is 38/η. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of nonadaptive methods at the EoS. Whereas nonadaptive algorithms at the EoS are blocked from entering highcurvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into highcurvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning. 
Jeremy M Cohen · Behrooz Ghorbani · Shankar Krishnan · Naman Agarwal · Sourabh Medapati · Michal Badura · Daniel Suo · Zachary Nado · George Dahl · Justin Gilmer 🔗 


NonUniform Sampling and Adaptive Optimizers in Deep Learning
(
Poster
)
>
link
Stochastic gradient descent samples uniformly the training set to build an unbiased gradient estimate with a limited number of samples. However, at a given step of the training process, some data are more helpful than others to continue learning. Importance sampling for training deep neural networks has been widely studied to propose sampling schemes yielding better performance than the uniform sampling scheme. After recalling the theory of importance sampling for deep learning, this paper focuses on the interplay between the sampling scheme and the optimizer used. We show that the sampling proportional to the persample gradient norms is not optimal for adaptive optimizers, although it is the case for stochastic gradient descent in its standard form. This implies that new sampling schemes have to be designed with respect to the optimizer used. Thus, using approximations of the persample gradient norms scheme with adaptive optimizers is likely to yield unsatisfying results. 
Thibault Lahire 🔗 


Largescale Nonconvex Stochastic Constrained Distributionally Robust Optimization
(
Poster
)
>
link
Distributionally robust optimization (DRO) is a powerful framework for training robust models against data distribution shifts. This paper focuses on constrained DRO, which has an explicit characterization of the robustness level. Existing studies on constrained DRO mostly focus on convex loss function, and exclude the practical and challenging case with nonconvex loss function, e.g., neural network. This paper develops a stochastic algorithm and its performance analysis for nonconvex constrained DRO. The computational complexity of our stochastic algorithm at each iteration is independent of the overall dataset size, and thus is suitable for largescale applications. We focus on the general CressieRead family divergence defined uncertainty set which includes $\chi^2$divergences as a special case. We prove that our algorithm finds an $\epsilon$stationary point with an improved computational complexity than existing methods. Our method also applies to the smoothed conditional value at risk (CVaR) DRO.

Qi Zhang · Shaofeng Zou · Yi Zhou · Lixin Shen · Ashley PraterBennette 🔗 


InformationTheoretic Trust Regions for Stochastic GradientBased Optimization
(
Poster
)
>
link
Stochastic gradientbased optimization is crucial to optimize neural networks. While popular approaches heuristically adapt the step size and direction by rescaling gradients, a more principled approach to improve optimizers requires secondorder information. Such methods precondition the gradient using the objective’s Hessian. Yet, computing the Hessian is usually expensive and effectively using secondorder information in the stochastic gradient setting is nontrivial. We propose using InformationTheoretic Trust Region Optimization (arTuRO) for improved updates with uncertain secondorder information. By modeling the network parameters as a Gaussian distribution and using a KullbackLeibler divergencebased trust region, our approach takes bounded steps accounting for the objective’s curvature and uncertainty in the parameters. Before each update, it solves the trust region problem for an optimal step size, resulting in a more stable and faster optimization process. We approximate the diagonal elements of the Hessian from stochastic gradients using a simple recursive least squares approach, constructing a model of the expected Hessian over time using only firstorder information. We show that arTuRO combines the fast convergence of adaptive momentbased optimization with the generalization capabilities of SGD. 
Philipp Dahlinger · Philipp Becker · Maximilian Hüttenrauch · Gerhard Neumann 🔗 


Decentralized Learning Dynamics in the Gossip Model
(
Poster
)
>
link
We study a distributed multiarmed bandit setting among a population of $n$ memoryconstrained nodes in the gossip model: at each round, every node locally adopts one of $m$ arms, observes a reward drawn from the arm’s (adversarially chosen) distribution, and then communicates with a randomly sampled neighbor, exchanging information to determine its policy in the next round. We introduce and analyze several families of dynamics for this task that are *decentralized*: each node's decision is entirely local and depends only on its most recently obtained reward and that of the neighbor it sampled. We show a connection between the global evolution of these decentralized dynamics with a certain class of *“zerosum” multiplicative weight update* algorithms, and we develop a general framework for analyzing the populationlevel regret of these natural protocols. Using this framework, we derive sublinear regret bounds under a wide range of parameter regimes (i.e., the size of $m$ and $n$) in an adversarial reward setting (where the mean of each arm's distribution can vary over time), when the number of rounds $T$ is at most logarithmic in $n$. Further, we show that these protocols can approximately optimize convex functions over the simplex when the reward distributions are generated from a stochastic gradient oracle.

John Lazarsfeld · Dan Alistarh 🔗 


Almost multisecant BFGS quasiNewton method
(
Poster
)
>
link
QuasiNewton (QN) methods provide an alternative to secondorder techniques for solving minimization problems by approximating curvature. This approach reduces computational complexity as it relies solely on firstorder information, and satisfying the secant condition. This paper focuses on multisecant (MS) extensions of QN, which enhances the Hessian approximation at low cost. Specifically, we use a lowrank perturbation strategy to construct an almostsecant QN method that maintains positive definiteness of the Hessian estimate, which in turn helps ensure constant descent (and reduces method divergence). Our results show that careful tuning of the updates greatly improve stability and effectiveness of multisecant updates. 
Mokhwa Lee · Yifan Sun 🔗 


From 6235149080811616882909238708 to 29: Vanilla Thompson Sampling Revisited
(
Poster
)
>
link
In this work, we derive a new problemdependent regret bound for Thompson Sampling with Gaussian priors (Algorithm 2 in [Agrawal and Goyal, 2017]), a classical stochastic bandit algorithm that has demonstrated excellent empirical performance and has been widely deployed in realworld applications. The existing regret bound is $\sum\limits_{i \in \mathcal{A}: \Delta_i >0}\frac{288 \left(e^{64}+6 \right) \ln \left(T\Delta_i^2 + e^{32} \right)}{\Delta_i} + \frac{10.5}{\Delta_i} + \Delta_i$, where $\mathcal{A}$ is the arm set, $\Delta_i$ denotes the single round performance loss when pulling a suboptimal arm $i$ instead of the optimal arm, and $T$ is the learning time horizon. Since realworld learning tasks care about the performance with a finite learning time horizon $T$, the existing regret bound is only nonvacuous when $T > 288 \cdot e^{64}$, which may not be practical. Our new regret bound is $ \sum\limits_{i \in \mathcal{A}: \Delta_i >0} \frac{1252 \ln \left(T \Delta_i^2 + 100^{\frac{1}{3}}\right)}{\Delta_i} +\frac{18 \ln \left(T\Delta_i^2 \right)}{\Delta_i} + \frac{182.5}{\Delta_i}+ \Delta_i$, which tightens the leading term's coefficient significantly. Despite having made some improvements, we would like to emphasize that the goal of this work is to deepen the understanding of Thompson Sampling from a theoretical perspective to unlock the full potential of this classical learning algorithm for solving challenging realworld learning problems.

Bingshan Hu · Tianyue Zhang 🔗 


Utilitybased Perturbed Gradient Descent: An Optimizer for Continual Learning
(
Poster
)
>
link
Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utilitybased Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We adopt the challenging setup of streaming learning as the testing ground and design continual learning problems with hundreds of nonstationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses all methods in all problems, being demonstrably capable of addressing both issues. 
Mohamed Elsayed · Rupam Mahmood 🔗 


Revisiting Random Weight Perturbation for Efficiently Improving Generalization
(
Poster
)
>
link
Improving the generalization ability of modern deep neural networks (DNNs) is a fundamental problem in machine learning. Two branches of methods have been proposed to seek flat minima and improve generalization: one led by sharpnessaware minimization (SAM) minimizes the worstcase neighborhood loss through adversarial weight perturbation (AWP), and the other minimizes the expected Bayes objective with random weight perturbation (RWP). Although RWP has advantages in training time and is closely linked to AWP on a mathematical basis, its empirical performance always lags behind that of AWP. In this paper, we revisit RWP and analyze its convergence properties. We find that RWP requires a much larger perturbation magnitude than AWP, which leads to convergence issues. To resolve this, we propose mRWP that incorporates the original loss objective to aid convergence, significantly lifting the performance of RWP. Compared with SAM, mRWP is more efficient since it enables parallel computing of the two gradient steps and faster convergence, with comparable or even better performance. We will release the code for reproducibility. 
Tao Li · Weihao weihao · Qinghua Tao · Zehao Lei · Yingwen Wu · Kun Fang · Mingzhen He · Xiaolin Huang 🔗 


MSL: An Adaptive Momentembased Stochastic Linesearch Framework
(
Poster
)
>
link
Various adaptive step sizes have been proposed recently to reduce the amount of tedious manualtuning. A popular example is backtracking linesearch based on a stochastic Armijo condition.But the success of this strategy relies crucially on the search direction being a descent direction.Importantly, this condition is violated by both SGD with momentum (SGDM) and Adam, which arecommon choices in deepnet training. Adaptively choosing the step size in this setting is thus nontrivial and less explored despite its practical relevance. In this work, we propose two frameworks,namely, momentum correction and restart, that allow the use of stochastic linesearch in conjunctionwith a generalized Armijo condition, and apply them to both SGDM and Adam. We empiricallyverify that the proposed algorithms are robust to the choice of the momentum parameter and otherhyperparameters. 
Chen Fan · Sharan Vaswani · Christos Thrampoulidis · Mark Schmidt 🔗 


Noise Stability Optimization for Flat Minima with Tight Rates
(
Poster
)
>
link
Generalization properties are a central aspect of the design and analysis of learning algorithms. One notion that has been considered in many previous works as leading to good generalization is flat minima, which informally describes a loss surface that is insensitive to noise perturbations. However, the design of efficient algorithms (that are easy to analyze) to find them is relatively underexplored. In this paper, we propose a new algorithm to address this issue, which minimizes a stochastic optimization objective that averages noise perturbations injected into the weights of a function. This algorithm is shown to enjoy both theoretical and empirical advantages compared to existing algorithms involving worstcase perturbations. Theoretically, we show tight convergence rates of our algorithm to find firstorder stationary points of the stochastic objective. Empirically, the algorithm induces a penalty on the trace of the Hessian, leading to iterates that are flatter than SGD and other alternatives, with tighter generalization gaps. Altogether, this work contributes a provable and practical algorithm to find flat minima by optimizing the noise stability properties of a function. 
Haotian Ju · Dongyue Li · Hongyang Zhang 🔗 


Dueling Optimization with a Monotone Adversary
(
Oral
)
>
link
We introduce and study the problem of \textit{dueling optimization with a monotone adversary}, which is a generalization of (noiseless) dueling convex optimization. The goal is to design an online algorithm to find a minimizer $\mathbf{x}^{\star}$ for a function $f\colon \mathcal{X} \to \mathbb{R}$, where $\mathcal{X} \subseteq \mathbb{R}^d$. In each round, the algorithm submits a pair of guesses, i.e., $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$, and the adversary responds with \textit{any} point in the space that is at least as good as both guesses. The cost of each query is the suboptimality of the worse of the two guesses; i.e., ${\max} \left( f(\mathbf{x}^{(1)}), f(\mathbf{x}^{(2)}) \right)  f(\mathbf{x}^{\star})$. The goal is to minimize the number of iterations required to find an $\varepsilon$optimal point and to minimize the total cost (regret) of the guesses over many rounds. Our main result is an efficient randomized algorithm for several natural choices of the function $f$ and set $\mathcal{X}$ that incurs cost $O(d)$ and iteration complexity $O(d\log(1/\varepsilon)^2)$. Moreover, our dependence on $d$ is asymptotically optimal, as we show examples in which any randomized algorithm for this problem must incur $\Omega(d)$ cost and iteration complexity.

Avrim Blum · Meghal Gupta · Gene Li · Naren Manoj · Aadirupa Saha · Yuanyuan Yang 🔗 


Noiseadaptive (Accelerated) Stochastic HeavyBall Momentum
(
Poster
)
>
link
We analyze the convergence of stochastic heavy ball (SHB) momentum in the smooth, stronglyconvex setting. Kidambi et al. showed that SHB (with small minibatches) cannot attain an accelerated rate of convergence even for quadratics, and conjecture that the practical gains of SHB are a byproduct of minibatching. We substantiate this claim by showing that SHB can obtain an accelerated rate when the minibatch size is larger than some threshold. In particular, for stronglyconvex quadratics with condition number $\kappa$, we prove that SHB with the standard stepsize and momentum parameters results in an $O\left(\exp(\frac{T}{\sqrt{\kappa}}) + \sigma\right)$ convergence rate, where $T$ is the number of iterations and $\sigma^2$ is the variance in the stochastic gradients. To ensure convergence to the minimizer, we propose a multistage approach that results in a noiseadaptive $O\left(\exp\left(\frac{T}{\sqrt{\kappa}}\right) + \frac{\sigma}{T}\right)$ rate. For general stronglyconvex functions, we use the averaging interpretation of SHB along with exponential stepsizes to prove an $O\left(\exp\left(\frac{T}{\kappa}\right) + \frac{\sigma^2}{T}\right)$ convergence to the minimizer. Finally, we empirically demonstrate the effectiveness of the proposed algorithms.

Anh Dang · Reza Babanezhad Harikandeh · Sharan Vaswani 🔗 


Unnormalized Density Estimation with Root Sobolev Norm Regularization
(
Poster
)
>
link
Density estimation is one of the most central problems in statistical learning.In this paper we introduce a new approach to nonparametric density estimation that is statistically consistent, is provably different from Kernel Density Estimation, makes the inductive bias of the model clear and interpretable, and performs well on a variety of relatively high dimensional problems.One of the key points of interest in terms of optimization is our use of natural gradients (in Hilbert Spaces). The optimization problem we solve is nonconvex, and standard gradient methods do not perform well. However, we show that the problem is convex on a certain positive cone, and natural gradient steps preserve this cone. The standard gradient steps, on the other hand, tend to lose positivity. This is one of the few cases in the literature where the reasons for the practical preference for the natural gradient are clear.In more detail, our approach is based on regularizing a version of a Sobolev norm of the density, and there are several core components that enable the method. First, while there is no closed analytic form for the associated kernel, we show that one can approximate it using sampling. Second, appropriate initialization and natural gradients are used as discussed above. Finally, while the approach produces unnormalized densities, which prevents the use of crossvalidation, we show that one can instead adopt the Fisher Divergencebased Score Matching methods for this task. We evaluate the resulting method on a comprehensive recent tabular anomaly detection benchmark suite which contains more than 15 healthcare and biologyoriented data sets (ADBench), and find that it ranks second best, among more than 15 algorithms. 
Mark Kozdoba · Binyamin Perets · Shie Mannor 🔗 


Accelerating Inexact HyperGradient Descent for Bilevel Optimization
(
Poster
)
>
link
We present a method for solving general nonconvexstronglyconvex bilevel optimization problems. Our methodthe Restarted Accelerated HyperGradient Descent (RAHGD) methodfinds an $\epsilon$firstorder stationary point of the objective with $\tilde{\mathcal{O}}(\kappa^{3.25}\epsilon^{1.75})$ oracle complexity, where $\kappa$ is the condition number of the lowerlevel objective and $\epsilon$ is the desired accuracy. We also propose a perturbed variant of RAHGD for finding an $\big(\epsilon,\mathcal{O}(\kappa^{2.5}\sqrt{\epsilon}\ )\big)$secondorder stationary point within the same order of oracle complexity. Our results achieve the bestknown theoretical guarantees for finding stationary points in bilevel optimization and also improve upon the existing upper complexity bound for finding secondorder stationary points in nonconvexstronglyconcave minimax optimization problems, setting a new stateoftheart benchmark. Empirical studies are conducted to validate the theoretical results in this paper.

Yang Haikuo · Luo Luo · Chris Junchi Li · Michael Jordan · Maryam Fazel 🔗 


High Dimensional Unbiased Estimation for Sequential Decision Making
(
Oral
)
>
link
We study the problem of making predictions of an adversarially chosen high dimensional state that are \emph{unbiased} subject to an arbitrary collection of conditioning events, with the goal of tailoring these events to downstream decision makers. We give efficient algorithms for solving this problem, as well as a number of applications that stem from choosing the set of conditioning events appropriately. For example, we can efficiently produce predictions targeted at any polynomial number of decision makers, such that if they best respond to our predictions, each of them has diminishing swap regret at the optimal rate. We then generalize this to the online combinatorial optimization problem, where the decision makers have large action spaces corresponding to structured subsets of a set of base actions: We give the first algorithms that can guarantee (to any polynomial number of decision makers) regret to the best fixed action, not just overall, but on any polynomial number of subsequences that can depend on the actions chosen as well as any external context. We show how playing in an extensive form game can be cast into this framework, and use these results to give efficient algorithms for obtaining subsequence regret in extensive form games, which gives a new family of efficiently obtainable regret guarantees that captures and generalizes previously studied notions like regret to informed causal deviations, and is generally incomparable to other known families of efficiently obtainable guarantees.We then turn to uncertainty quantification in machine learning, and consider the problem of producing \emph{prediction sets} for multiclass and multilabel classification problems. We show how to produce class scores that have \emph{transparent coverage guarantees}  they can be used to produce prediction sets that cover the true labels at the rate that they would have had the scores been true conditional probabilities. Moreover, we show how to do this such that the scores have improved Brier score (or crossentropy loss) compared to any collection of benchmark models. Compared to conformal prediction techniques, this both gives increased flexibility and eliminates the need to choose a nonconformity score. 
Georgy Noarov · Ramya Ramalingam · Aaron Roth · Stephan Xie 🔗 


Efficient Learning in Polyhedral Games via Best Response Oracles
(
Poster
)
>
link
We study online learning and equilibrium computation in games with polyhedral decision sets with only firstorder oracle and bestresponse oracle access. Our approach achieves constant regret in zerosum games and $O(T^{1/4})$ in generalsum games, while using only $O(\log t)$ bestresponse queries at a given iteration $t$. This convergence occurs at a linear rate, though with a conditionnumber dependence. Our algorithm also achieves bestiterate convergence at a rate of $O(1/\sqrt{T})$ without such a dependence. Our algorithm uses a linearlyconvergent variant of FrankWolfe (FW) whose linear convergence depends on a condition number of the polytope known as the facial distance. We show two broad new results, characterizing the condition number when the polyhedral sets satisfy a certain structure.

Darshan Chakrabarti · Gabriele Farina · Christian Kroer 🔗 


On the Convergence of Local SGD Under ThirdOrder Smoothness and Hessian Similarity
(
Poster
)
>
link
Local SGD (i.e., Federated Averaging without client sampling) is widely used for solving federated optimization problems in the presence of heterogeneous data.However, there is a gap between the existing convergence rates for Local SGD and its observed performance on realworld problems. It seems that current rates do not correctly capture the effectiveness Local SGD. We first show that the existing rates for Local SGD in heterogeneous setting cannot recover the correct rate when the global function is a quadratic. Then we first derive a new rate for the case that the global function is a general strongly convex function depending on thirdorder smoothness and Hessian similarity. These additional parameters allow us to capture the problem in a more refined way and to overcome some of the limitations of the previous worstcase results derived under the standard assumptions. Then we show a rate for Local SGD when all clients are nonconvex quadratic functions with identical Hessians. 
Ali Zindari · Ruichen Luo · Sebastian Stich 🔗 


Adam through a SecondOrder Lens
(
Poster
)
>
link
Research into optimisation for deep learning is characterised by a tension between the computational efficiency of firstorder, gradientbased methods (such as SGD and Adam) and the theoretical efficiency of secondorder, curvaturebased methods (such as quasiNewton methods and KFAC). We seek to combine the benefits of both approaches into a single computationallyefficient algorithm. Noting that secondorder methods often depend on stabilising heuristics (such as LevenbergMarquardt damping), we propose AdamQLR: an optimiser combining damping and learning rate selection techniques from KFAC with the update directions proposed by Adam, inspired by considering Adam through a secondorder lens. We evaluate AdamQLR on a range of regression and classification tasks at various scales, achieving competitive generalisation performance vs runtime. 
Ross Clarke · Baiyu Su · José Miguel HernándezLobato 🔗 


How OverParameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization
(
Poster
)
>
link
This paper rigorously shows how overparameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown lowrank groundtruth matrix from nearisotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where $M^* \in \mathbb{R}^{n \times n}$ is a positive semidefinite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX^\top$ to learn $M^*$. Here $X \in \mathbb{R}^{n \times k}$ with $k > r$ is the factor matrix.We give a novel $\Omega\left(1/T^2\right)$ lower bound of randomly initialized GD for the overparameterized case ($k >r$) where $T$ is the number of iterations.This is in stark contrast to the exactparameterization scenario ($k=r$) where the convergence rate is $\exp\left(\Omega\left(T\right)\right)$.Next, we study asymmetric setting where $M^* \in \mathbb{R}^{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min\{n_1,n_2\}$, and one uses an asymmetric parameterization $FG^\top$ to learn $M^*$ where $F \in \mathbb{R}^{n_1 \times k}$ and $G \in \mathbb{R}^{n_2 \times k}$.We give the first global exact convergence result of randomly initialized GD for the exactparameterization case ($k=r$) with an $\exp\left(\Omega\left(T\right)\right)$ rate.Furthermore, we give the first global exact convergence result for the overparameterization case ($k>r$) with an $\exp\left(\Omega\left(\alpha^2 T\right)\right)$ rate where $\alpha$ is the initialization scale.This linear convergence result in the overparameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega\left(1/T^2\right)$ to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between $F$ and $G$. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on $\alpha$ in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exactparameterization case. We provide empirical studies to verify our theoretical findings.

Nuoya Xiong · Lijun Ding · Simon Du 🔗 


Exploring Modern Evolution Strategies in Portfolio Optimization
(
Poster
)
>
link
Blackbox optimization (BBO) techniques are often the core engine used in combinatorial optimization problems which include multiasset class portfolio construction. The computational complexity of such evolutionary algorithms, however, is excessively high to the point that finding optimal portfolios in large search spaces becomes intractable. Additionally, the learning dynamics of BBO methods are typically heuristic as they do not use gradient evolutions. To alleviate these challenges, in this paper, we set out to leverage advances in metalearningbased evolution strategy (ES), Adaptive ESActive Subspaces, and fastmoving natural ES to improve highdimensional and large search space portfolio optimization. Using such modern ES algorithms in a series of riskaware passive and active asset allocation problems, we obtain one to three orders of magnitude efficiency in finding optimal portfolios compared to vanilla BBO methods. Moreover, as we increase the number of asset classes, our modern suite of BBOs finds better local optima resulting in better financial advice quality. 
Ramin Hasani · Abbas Ehsanfar · Greg Banis · Rusty Bealer · Amir Ahmadi 🔗 


Greedy Newton: Newton's Method with Exact Line Search
(
Poster
)
>
link
A defining characteristic of Newton's method is local superlinear convergence. In successively smaller neighborhoods around a strict minimizer, the objective becomes increasingly quadratic, the Newton direction gets closer to optimal, and the method accelerates to the solution. Outside this neighborhood, however, Newton's method converges slowly or even diverges. The issue of nonconvergence is usually dealt with by (a) modifications to the Hessian for positive definiteness, and (b) using damped Newton steps. But these approaches change the nature of Newton's method. The former obscures secondorder information and changes the update direction arbitrarily. The latter, normally implemented as an Armijolike line search with an initial stepsize of one, is inexact and could unnecessarily restrict stepsizes to lie between zero and one. In this paper, we analyze Newton's method under an exact line search, which we call "Greedy Newton" (GN). Many problem structures allow for efficient and precise line searches but, as far as we know, a superlinear convergence proof does not exist for this setting. We show that GN not only retains the local superlinear convergence rate, but could also improve the global rate. We experimentally show that GN may work better than damped Newton by allowing stepsizes to deviate significantly from one. Our experiments also suggest that GN combined with Hessian modifications is a powerful optimization method that works in both convex and nonconvex settings. 
Betty Shea · Mark Schmidt 🔗 


A proximal augmented Lagrangian based algorithm for federated learning with constraints
(
Poster
)
>
link
This paper considers federated learning (FL) with constraints where the central server and all local clients collectively minimize a sum of local objective functions subject to inequality constraints. To train the model without moving local data at clients to the central server, we propose an FL framework that each local client performs multiple updates using the local objective and local constraints, while the central server handles the global constraints and performs aggregation based on the updated local models. In particular, we develop a proximal augmented Lagrangian (AL) based algorithm, where the subproblems are solved by an inexact alternating direction method of multipliers (ADMM) in a federated fashion. Under mild assumptions, we establish the worstcase complexity bounds of the proposed algorithm. Our numerical experiments demonstrate the practical advantages of our algorithm in solving linearly constrained quadratic programming and performing NeymanPearson classification in the context of FL. 
Chuan He · Le Peng · Ju Sun 🔗 


Structured InverseFree Natural Gradient: MemoryEfficient & NumericallyStable KFAC for Large Neural Nets
(
Poster
)
>
link
Secondorder methods for deep learning—such as KFAC—can be useful for neural network training.However, they are often memoryinefficient and numerically unstable for lowprecision training since their preconditioning Kronecker factors are dense, and require highprecision matrix inversion or decomposition. Thus, such methods are not widely used for training large neural networks such as transformerbased models. We address these two issues by (i) formulating an inversefree update of KFAC and (ii) imposing structures in each of the Kronecker factors, resulting in a method we term structured inversefree natural gradient descent (SINGD). On large modern neural networks, we show that, in contrast to KFAC, SINGD is memory efficient and numerically robust. 
Wu Lin · Felix Dangel · Runa Eschenhagen · Kirill Neklyudov · Agustinus Kristiadi · Richard Turner · Alireza Makhzani 🔗 


Statistical Inference of Adaptive Inexact Stochastic Newton Method
(
Poster
)
>
link
We aim to study the practical statistical inference of the online secondorder Newton method for general unconstrained stochastic optimization problems under the fixed dimension setting. We consider the adaptive inexact stochastic Newton method, which is reduced from an existing stochastic sequential programming (StoSQP) method to the unconstrained setting. Based on the asymptotic normality of the last iteration, we propose a weighted sample covariance matrix, which is a consistent covariance matrix estimator. With this estimator, we are able to conduct statistical inference on the solution of the stochastic optimization problem in practice. The update of the estimator is entirely online and efficient in computation and memory. We demonstrate the empirical performance through numerical experiments on linear regression models. 
Wei Kuang · Sen Na · Mihai Anitescu 🔗 


$f$FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization
(
Poster
)
>
link
Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term ``stochastic'' refers to the ability of the algorithm to work with small minibatches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on $f$divergence measures ($f$FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairnessaccuracy tradeoffs offered by $f$FERM for almost all batch sizes (ranging from fullbatch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of $f$FERM objective under $\ell_p$ norms as uncertainty sets. Again in this distributionally robust setting, $f$FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$FERM is publicly available.

Sina Baharlouei · Shivam Patel · Meisam Razaviyayn 🔗 


Oracle Efficient Algorithms for Groupwise Regret
(
Poster
)
>
link
We study the problem of online prediction, in which at each time step $t \in {1,2, \cdots T}$, an individual $x_t$ arrives, whose label we must predict. Each individual is associated with various groups, defined based on their features such as age, sex, race etc., which may intersect. Our goal is to make predictions that have regret guarantees not just overall but also simultaneously on each subsequence comprised of the members of any single group. Previous work such as [Blum & Lykouris][1] and [Lee et al][2] provide attractive regret guarantees for these problems; however, these are computationally intractable on large model classes (e.g., the set of all linear models, as used in linear regression). We show that a simple modification of the sleeping experts technique of [Blum & Lykouris][1] yields an efficient reduction to the wellunderstood problem of obtaining diminishing external regret absent group considerations. Our approach gives similar regret guarantees compared to [Blum & Lykouris][1]; however, we run in time linear in the number of groups, and are oracleefficient in the hypothesis class. This in particular implies that our algorithm is efficient whenever the number of groups is polynomially bounded and the externalregret problem can be solved efficiently, an improvement on [Blum & Lykouris][1]'s stronger condition that the model class must be small. Our approach can handle online linear regression and online combinatorial optimization problems like online shortest paths. Beyond providing theoretical regret bounds, we evaluate this algorithm with an extensive set of experiments on synthetic data and on two real data sets  Medical costs and the Adult income dataset, both instantiated with intersecting groups defined in terms of race, sex, and other demographic characteristics. We find that uniformly across groups, our algorithm gives substantial error improvements compared to running a standard online linear regression algorithm with no groupwise regret guarantees.

Krishna Acharya · Eshwar Ram Arunachaleswaran · Juba Ziani · Aaron Roth · Sampath Kannan 🔗 


(Un)certainty selection methods for Active Learning on Label Distributions
(
Poster
)
>
link
Some supervised learning problems can require predicting a probability distribution over possible answers than one (set of) answer(s). In such cases, a major scaling issue is the amount of labels needed, since compared to their single or multilabel counterparts, distributional labels are typically (1) harder to learn and (2) more expensive to obtain for training and testing. In this paper, we explore the use of active learning to alleviate this bottleneck. We progressively train a label distribution learning model by selectively labeling data and, achieving the minimum error rate with fifty percent fewer data items than nonactive learning strategies. Our experiments show that certaintybased query strategies outperform uncertaintybased ones on the label distribution learning problems we study. 
James Spann · Christopher Homan 🔗 


SGD batch saturation for training wide neural networks
(
Poster
)
>
link
The performance of the minibatch stochastic gradient method strongly depends on the batchsize that is used. In the classical convex setting with interpolation, prior work showed that increasing the batch size linearly increases the convergence speed, but only up to a point; when the batch size is larger than a certain threshold (the critical batchsize), further increasing the batch size only leads to negligible improvement. The goal of this work is to investigate the relationship between the batchsize and convergence speed for a broader class of nonconvex problems. Building on recent improved convergence guarantees for SGD, we prove that a similar linear scaling and batchsize saturation phenomenon occurs for training sufficiently wide neural networks. We conduct a number of numerical experiments on benchmark datasets, which corroborate our findings. 
Chaoyue Liu · Dmitriy Drusvyatskiy · Misha Belkin · Damek Davis · Yian Ma 🔗 


Stochastic VarianceReduced Newton: Accelerating FiniteSum Minimization with Large Batches
(
Poster
)
>
link
Stochastic variance reduction has proven effective at accelerating firstorder algorithms for solving convex finitesum optimization tasks such as empirical risk minimization. Yet, the benefits of variance reduction for firstorder methods tend to vanish in the largebatch setting, i.e., when stochastic gradients are computed from very large minibatches to leverage parallelization of modern computing architectures.On the other hand, incorporating secondorder information via Newtontype methods has proven successful in improving the performance of largebatch algorithms. In this work, we show that, in the presence of secondorder information, variance reduction in the gradient can provide significant convergence acceleration even when using extremely largebatch gradient estimates. To demonstrate this, we study a finitesum minimization algorithm we call Stochastic VarianceReduced Newton (SVRN). We show that SVRN provably accelerates existing stochastic Newtontype methods (such as Subsampled Newton), while retaining their parallelizable largebatch operations: The number of passes over the data is reduced from $O(\alpha\log(1/\epsilon))$ to $O\big(\frac{\log(1/\epsilon)}{\log(n)}\big)$, i.e., by a factor of $O(\alpha\log(n))$, where $n$ is the number of sum components and $\alpha$ is the approximation factor in the Hessian estimate. Surprisingly, this acceleration gets more significant the larger thedata size $n$, and can be achieved with a unit step size.

Michal Derezinski 🔗 


Enhancing the Misreport Network for Optimal Auction Design
(
Poster
)
>
link
Optimal auction mechanism design has long been a focus in computer science and economics. While substantial progress has been made in singleitem auctions, optimal design for multiitem auctions has yet to be derived. Recent years have seen a surge in deriving nearoptimal auctions through deep learning. As one of the approaches, ALGNet models the bidding process as a twoplayer game. The ALGNet model however adopted a rather simple design for generating optimal misreports to derive the regret of the trained auction mechanisms. We show that this design can be improved both in network structure and the testing method. Specifically, we train a misreport network tailored for each individual bidder which leads to better misreports. This approach is especially effective when the auctions are asymmetric. By studying misreport, we can get a more accurate estimate of the regret in the auction mechanism thus enhancing its robustness. Experimental results demonstrate that our approach can detect misreport more effectively than previous methods resulting in an increase in regret values as large as 70%. The new misreport network can also be applied to train auction mechanisms, allowing for a better description of the auction process. 
Haiying Wu · shuyuan you · Zhiqiang Zhuang · Kewen Wang · Zhe Wang 🔗 


Towards a Better Theoretical Understanding of Independent Subnetwork Training
(
Poster
)
>
link
Modern advancements in largescale machine learning would be impossible without the paradigm of dataparallel distributed computing. Since distributed computing with largescale models imparts excessive pressure on communication channels, significant recent research has been directed toward codesigning communication compression strategies and training algorithms with the goal of reducing communication costs. While pure data parallelism allows better data scaling, it suffers from poor model scaling properties. Indeed, compute nodes are severely limited by memory constraints, preventing further increases in model size. For this reason, the latest achievements in training giant neural network models also rely on some form of model parallelism. In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model. 
Egor Shulgin · Peter Richtarik 🔗 


Adaptive QuasiNewton and Anderson Acceleration Framework with Explicit Global (Accelerated) Convergence Rates
(
Poster
)
>
link
Despite the impressive numerical performance of quasiNewton and Anderson/nonlinearacceleration methods, their global convergence rates have remained elusive for over 50 years. This paper addresses this longstanding question by introducing a framework that derives novel and adaptive quasiNewton or nonlinear/Anderson acceleration schemes. Under mild assumptions, the proposed iterative methods exhibit explicit, nonasymptotic convergence rates that blend those of gradient descent and Cubic Regularized Newton's method. The proposed approach also includes an accelerated version for convex functions. Notably, these rates are achieved adaptively, without prior knowledge of the function's smoothness parameter. The framework presented in this paper is generic, and algorithms such as Newton's method with random subspaces, finite difference, or lazy Hessian can be seen as special cases of this paper's algorithm. Numerical experiments demonstrate the efficiency of the proposed framework, even compared to the LBFGS algorithm with Wolfe line search. 
Damien Scieur 🔗 


Sion's Minimax Theorem in Geodesic Metric Spaces and a Riemannian Extragradient Algorithm
(
Poster
)
>
link
Deciding whether saddle points exist or are approximable for nonconvexnonconcave problems is usually intractable. We take a step toward understanding a broad class of nonconvexnonconcave minimax problems that do remain tractable. Specifically, we study minimax problems in geodesic metric spaces. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and broadly accessible as it relies on the finite intersection property alone. The second main result is a specialization to geodesically complete Riemannian manifolds, for which we analyze firstorder methods for smooth minimax problems. 
Peiyuan Zhang · Jingzhao Zhang · Suvrit Sra 🔗 


Cup Curriculum: Curriculum Learning on Model Capacity
(
Poster
)
>
link
Curriculum learning (CL) aims to increase the performance of a learner on a given task by applying a specialized learning strategy.This strategy focuses on either the dataset, the task, or the model.There is little to no work analysing the possibilities to apply CL on the model capacity in natural language processing.To close this gap, we propose the cup curriculum.In a first phase of training we use a variation of iterative magnitude pruning to reduce model capacity.These weights are reintroduced in a second phase, resulting in the model capacity to show a cupshaped curve over the training iterations.We empirically evaluate different strategies of the cup curriculum and show that it outperforms early stopping reliably while exhibiting a high resilience to overfitting. 
Luca Scharr · Vanessa Toborek 🔗 


An Algorithm with Optimal DimensionDependence for ZeroOrder Nonsmooth Nonconvex Stochastic Optimization
(
Oral
)
>
link
We study the complexity of producing $(\delta,\epsilon)$stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations.Recent works proposed several stochastic zeroorder algorithms that solve this task, all of which suffer from a dimensiondependence of $\Omega(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal. We refute this conjecture by providing a faster algorithm that has complexity $O(d\delta^{1}\epsilon^{3})$, which is optimal (up to numerical constants) with respect to $d$ and also optimal with respect to the accuracy parameters $\delta,\epsilon$, thus solving an open question due to Lin et al. (NeurIPS'22). Moreover, the convergence rate achieved by our algorithm is also optimal for smooth objectives, proving that in the nonconvex stochastic zeroorder setting, *nonsmooth optimization is as easy as smooth optimization*. We provide algorithms that achieve the aforementioned convergence rate in expectation as well as with high probability.Our analysis is based on a simple yet powerful geometric lemma regarding the Goldsteinsubdifferential set, which allows utilizing recent advancements in firstorder nonsmooth nonconvex optimization.

Guy Kornowski · Ohad Shamir 🔗 


Fair Minimum Representation Clustering
(
Poster
)
>
link
Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to realworld constructs (e.g., electoral districts) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g., 50\% to elect their desired candidate). This paper considers the problem of performing kmeans clustering while ensuring groups (e.g., demographic groups) have that minimum level of representation in a specified number of clusters. We formulate the problem through a mixedinteger optimization framework and present an alternating minimization algorithm, called MiniReL, that directly incorporates the fairness constraints. While incorporating the fairness criteria leads to an NPHard assignment problem within the algorithm, we provide computational approaches that make the algorithm practical even for large datasets. Numerical results show that the approach is able to create fairer clusters with practically no increase in the clustering cost across standard benchmark datasets. 
Connor Lawless · Oktay Gunluk 🔗 


Fair Representation in Submodular Subset Selection: A Pareto Optimization Approach
(
Poster
)
>
link
In this paper, we study a novel multiobjective combinatorial optimization problem called Submodular Maximization with Fair Representation (SMFR), which selects subsets of bounded costs from a ground set such that a submodular (utility) function $f$ is maximized while a set of $d$ submodular (representativeness) functions $g_1, \ldots, g_d$ are also maximized.SMFR can find applications in machine learning problems where utility and representativeness objectives should be considered simultaneously, such as social advertising, recommendation, and feature selection.We show that the maximization of $f$ and $g_1, \ldots, g_d$ might conflict with each other, so that no single solution can approximate all of them at the same time.Therefore, we propose a Pareto optimization approach to SMFR, which finds a set of solutions to approximate all Pareto optimal solutions with different tradeoffs between these objectives.Specifically, it converts an instance of SMFR into several submodular cover instances by adjusting the weights of objective functions and provides approximate solutions by running the greedy algorithm on each submodular cover instance.Finally, we demonstrate the effectiveness of SMFR and our proposed approach in a realworld problem.

Adriano Fazzone · Yanhao Wang · Francesco Bonchi 🔗 


New Horizons in Parameter Regularization: A Constraint Approach
(
Poster
)
>
link
This work presents constrained parameter regularization (CPR), an alternative to traditional weight decay. Instead of applying a constant penalty uniformly to all parameters, we enforce an upper bound on a statistical measure (e.g., the L2norm) of individual parameter groups. This reformulates learning as a constrained optimization problem. To solve this, we utilize an adaptation of the augmented Lagrangian method. Our approach allows for varying regularization strengths across different parameter groups, removing the need for explicit penalty coefficients in the regularization terms. CPR only requires two hyperparameters and introduces no measurable runtime overhead. We offer empirical evidence of CPR's effectiveness through experiments in the "grokking" phenomenon, object detection, and language modeling. Our findings show that CPR can counteract the effects of grokking, and it consistently matches or surpasses the performance of traditional weight decay. 
Jörg Franke · Michael Hefenbrock · Gregor Koehler · Frank Hutter 🔗 


Continually Adapting Optimizers Improve MetaGeneralization
(
Poster
)
>
link
Metalearned optimizers increasingly outperform analytical handcrafted optimizers such as SGD and Adam. On some tasks, however, they fail to generalize strongly, underperforming handcrafted methods. Then one can fall back on handcrafted methods through a guard, to combine the efficiency benefits of learned optimizers and the guarantees of analytical methods. At some point in the iterative optimization process, however, such guards may make the learned optimizer incompatible with the remaining optimization, and thus useless for further progress. Our novel method Meta Guard keeps adapting the learned optimizer to the target optimization problem. It experimentally outperforms other baselines, adapting to new tasks during training. 
Wenyi Wang · Louis Kirsch · Francesco Faccio · Mingchen Zhuge · Jürgen Schmidhuber 🔗 


Surrogate Minimization: An Optimization Algorithm for Training Large Neural Networks with Model Parallelism
(
Poster
)
>
link
Optimizing large memoryintensive neural networks requires distributing its layers across multiple GPUs (referred to as model parallelism). We develop a framework that allows decomposing a neural network layerwise and train it by optimizing layerwise local losses in parallel. By using the resulting framework with GPipe [11] (an effective pipelining strategy for model parallelism), we propose the Surrogate Minimization (SM) algorithm. SM allows for multiple parallel updates to the layerwise parameters of a distributed neural network and consequently improves the GPU utilization of GPipe. Our framework ensures that the sum of local losses is a global upperbound on theneural network loss, and can be minimized efficiently. Under mild technical assumptions, we prove that SM requires O(1/ε) iterations in order to guarantee convergence to an εneighbourhood of a stationary point of the neural network loss. Finally, our experimental results on MLPs demonstrate that SM leads to faster convergence compared to competitive baselines. 
Reza Asad · Reza Babanezhad Harikandeh · Issam Hadj Laradji · Nicolas Le Roux · Sharan Vaswani 🔗 


On the Parallel Complexity of Multilevel Monte Carlo in Stocahstic Gradient Descent
(
Poster
)
>
link
In the stochastic gradient descent (SGD) for sequential simulations such as the neural stochastic differential equations, the Multilevel Monte Carlo (MLMC) method is known to offer better theoretical computational complexity compared to the naive Monte Carlo approach.However, in practice, MLMC scales poorly on massively parallel computing platforms such as modern GPUs, because of its large parallel complexity which is equivalent to that of the naive Monte Carlo method.To cope with this issue, we propose the delayed MLMC gradient estimator that drastically reduces the parallel complexity of MLMC by recycling previously computed gradient components from earlier steps. The proposed estimator provably reduces the average parallel complexity per iteration at the cost of a slightly worse periteration convergence rate.In our numerical experiments, we employ an example of deep hedging to demonstrate the superior parallel complexity of our method compared to the standard MLMC in SGD. 
Kei Ishikawa 🔗 


Pruning Neural Networks with VelocityConstrained Optimization
(
Poster
)
>
link
Pruning has gained prominence as a way to compress overparameterized neural networks. Whilepruning can be understood as solving a sparsityconstrained optimization problem, pruning by directly solving this problem has been relatively underexplored. In this paper, we propose a method toprune neural networks using the MJ algorithm, which interprets constrained optimization using theframework of velocityconstrained optimization. The experimental results show that our methodcan prune VGG19 and ResNet32 networks by more than 90% while preserving the high accuracyof the dense network. 
Donghyun Oh · Jinseok Chung · Namhoon Lee 🔗 


Feature Selection in Generalized Linear models via the Lasso: To Scale or Not to Scale?
(
Poster
)
>
link
The Lasso regression is a popular regularization method for feature selection in statistics. Prior to computing the Lasso estimator in both linear and generalized linear models, it is common to conduct a preliminary rescaling of the feature matrix to ensure that all the features are standardized. Without this standardization, it is argued, the Lasso estimate will unfortunately depend on the units used to measure the features. We propose a new type of iterative rescaling of the features in the context of generalized linear models. Whilst existing Lasso algorithms perform a single scaling as a preprocessing step, the proposed rescaling is applied iteratively throughout the Lasso computation until convergence. We provide numerical examples, with both real and simulated data, illustrating that the proposed iterative rescaling can significantly improve the statistical performance of the Lasso estimator without incurring any significant additional computational cost. 
Anant Mathur · Sarat Moka 🔗 


DIRECT Optimisation with Bayesian Insights: Assessing Reliability Under Fixed Computational Budgets
(
Poster
)
>
link
We introduce a method for probabilistically evaluating the reliability of direct optimisation under a constrained computational budget, a context frequently encountered in various applications. By interpreting the slope data gathered during the optimisation process as samples from the objective function's derivative, we utilise Bayesian posterior prediction to derive a confidence score for the optimisation outcomes. We validated our approach using numerical experiments on four multidimensional test functions, and the results highlight the practicality and efficacy of our approach. 
Fu Wang · Zeyu Fu · Xiaowei Huang · Wenjie Ruan 🔗 


Understanding the Role of Optimization in Double Descent
(
Poster
)
>
link
The phenomenon of modelwise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. While double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the perspective of optimization and propose a simple optimizationbased explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to modelwise double descent are unified from the viewpoint of optimization: modelwise double descent is observed if and only if the optimizer is able to find a sufficiently lowloss minimum.We conduct a series of controlled experiments on random feature models and twolayer neural networks under various optimization settings demonstrating this optimizationbased unified view.Our results suggest the following implication: double descent is unlikely to be a problem for realworld machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups. 
Chris Liu · Jeffrey Flanigan 🔗 


Variance Reduced Model Based Methods: New rates and adaptive step sizes
(
Poster
)
>
link
Variance reduced gradients methods were introduced to control the variance of SGD (Stochastic Gradient Descent). Modelbased methods are able to make use of a known lower bound on the loss, for instance, most loss functions are positive. We show how these two classes of methods can be seamlessly combined. As an example we present a Modelbased Stochastic Average Gradient method MSAG, which results from using a truncated model together with the SAG method. At each iteration MSAG computes an adaptive learning rate based on a given known lower bound. When given access to the optimal objective as the lower bound, MSAG has several favorable convergence properties, including monotonic iterates, and convergence in the nonsmooth, smooth and strongly convex setting. This shows that we can essentially tradeoff knowing the smoothness constant $L_{\max}$ for knowing the optimal objective to achieve the favourable convergence of variance reduced gradient methods.

Robert Gower · Frederik Kunstner · Mark Schmidt 🔗 


On the convergence of warped proximal iterations for solving nonmonotone inclusions and applications
(
Poster
)
>
link
In machine learning, tackling fairness, robustness, and safeness requires to solve nonconvex optimization problems with various constraints. In this paper, we investigate the warped proximal iterations for solving the nonmonotone inclusions and its application to nonconvex QP with equality constraints. 
Dimitri Papadimitriou · Bang Cong Vu 🔗 


On the Synergy Between Label Noise and Learning Rate Annealing in Neural Network Training
(
Poster
)
>
link
In the past decade, stochastic gradient descent (SGD) has emerged as one of the most dominant algorithms in neural network training, with enormous success in different application scenarios. However, the implicit bias of SGD with different training techniques is still underexplored. Some of the common heuristics in practice include 1) using large initial learning rates and decaying it as the training progresses, and 2) using minibatch SGD instead of fullbatch gradient descent. In this work, we show that under certain data distributions, these two techniques are both necessary to obtain good generalization on neural networks. We consider minibatch SGD with label noise, and at the heart of our analysis lies the concept of feature learning order, which has previously been characterized theoretically by Li et al. (2019) and Abbe et al. (2021). Notably, we use this to give the first concrete separations in generalization guarantees, between training neural networks with both label noise SGD and learning rate annealing and training with one of these elements removed. 
Stanley Wei · Tongzheng Ren · Simon Du 🔗 


Optimizing GroupFair PlackettLuce Ranking Models for Relevance and ExPost Fairness
(
Poster
)
>
link
In learningtorank (LTR), optimizing only the relevance (or the expected ranking utility) can cause representational harm to certain categories of items. We propose a novel objective that maximizes expected relevance only over those rankings that satisfy given representation constraints to ensure expost fairness.Building upon recent work on an efficient sampler for expost groupfair rankings, we propose a groupfair PlackettLuce model and show that it can be efficiently optimized for our objective in the LTR framework. Experiments on three realworld datasets show that our algorithm guarantees fairness alongside usually having better relevance compared to the LTR baselines. In addition, our algorithm also achieves better relevance than postprocessing baselines which also ensure expost fairness. Further, when implicit bias is injected into the training data, our algorithm typically outperforms existing LTR baselines in relevance. 
Sruthi Gorantla · Eshaan Bhansali · Amit Deshpande · Anand Louis 🔗 


Contrastive PredictandSearch for Mixed Integer Linear Programs
(
Poster
)
>
link
Mixed integer linear programs (MILP) are flexible and powerful tool for modeling and solving many difficult realworld combinatorial optimization problems. In this paper, we propose a novel machine learningbased framework ConPaS that learns to predict solutions to MILPs with contrastive learning. For training, we collect highquality solutions as positive samples and lowquality or infeasible solutions as negative samples. We then learn to make discriminative predictions by contrasting the positive and negative samples. During test time, we predict assignments for a subset of integer variables of a MILP and then solve the resulting reduced MILP to construct highquality solutions. Empirically, we show that ConPaS achieves stateoftheart results compared to other MLbased approaches in terms of the quality of and the speed at which the solutions are found. 
Taoan Huang · Aaron Ferber · Arman Zharmagambetov · Yuandong Tian · Bistra Dilkina 🔗 


Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle
(
Poster
)
>
link
Recent advances in deep learning have given us some verypromising results on the generalization ability of deep neural networks, howeverliterature still lacks a comprehensive theory explaining why heavilyoverparametrized models are able to generalize well while fitting the trainingdata. In this paper we propose a PAC type bound on the generalization error offeedforward ReLU networks via estimating the Rademacher complexity of the set ofnetworks available from an initial parameter vector via gradient descent. Thekey idea is to bound the sensitivity of the network's gradient to perturbationof the input data along the optimization trajectory. The obtained bound doesnot explicitly depend on the depth of the network. Our results areexperimentally verified on the MNIST and CIFAR10 datasets. 
Dániel Rácz · Mihaly Petreczky · Balint Daroczy 🔗 


Riemannian Optimization for Euclidean Distance Geometry
(
Poster
)
>
link
The Euclidean distance geometry (EDG) problem is a crucial machine learning task that appears in many applications. Utilizing the pairwise Euclidean distance information of a given point set, EDG reconstructs the configuration of the point system. When only partial distance information is available, matrix completion techniques can be incorporated to fill in the missing pairwise distances. In this paper, we propose a novel dual basis Riemannian gradient descent algorithm, coined RieEDG, for the EDG completion problem. The numerical experiments verify the effectiveness of the proposed algorithm. In particular, we show that RieEDG can precisely reconstruct various datasets consisting of 2 and 3dimensional points by accessing a small fraction of pairwise distance information. 
Chandler Smith · Samuel Lichtenberg · HanQin Cai · Abiy Tasissa 🔗 


GUC: Unsupervised nonparametric Global Clustering and Anomaly Detection
(
Poster
)
>
link
Anomaly Detection is a crucial task in the fields of optimization and Machine Learning,with the ability of detecting global anomalies being of particular importance. In this paper,we propose a novel nonparametric algorithm for automatically detecting global anomaliesin an unsupervised manner. Our algorithm is both effective and efficient, requiring no priorassumptions or domain knowledge to be applied. It features two modes that utilize thedistance from the dataset’s center for grouping data points together. The first mode splitsthe dataset into global clusters where each cluster signifies proximity from the center. Thesecond mode employs a threshold value for splitting the points into outliers and inliers. Weevaluate our proposal against other prominent methods using synthetic and real datasets.Our experiments demonstrate that the proposed algorithm achieves stateoftheart performance with minimum computational cost, and can successfully be applied to a wide rangeof Machine Learning applications. 
Chris Solomou 🔗 


Testing Approximate Stationarity Concepts for Piecewise Smooth Functions
(
Poster
)
>
link
We study various aspects of the fundamental computational problem of the stationarity test for piecewise smooth functions. Our contributions are:* Hardness: We show that checking firstorder approximate stationarity concepts in term of different subdifferential constructions for a piecewise linear function is strongly coNPhard. As a corollary, we proved that testing FOM for absnorm form is computational hard, confirming a conjecture by Griewank and Walther [14, SIAM J. Optim., 29 (2019), pp. 284].* Regularity: We establish a necessary and sufficient condition for the validity of an equalitytype subdifferential sum rule for the Sum of DifferenceofMax (SDM) functions by using tools from generalized differentiation theory and polytope theory. This SDM class is underlying the analysis of many important nonsmooth piecewise differentiable functions, including the empirical loss of twolayer ReLU networks.In particular, we show that this condition is efficiently checkable when the subdifferential set is of a zonotope type.* Robust algorithms: We introduce the first oraclepolynomialtime algorithm to test nearapproximate stationarity for a SDM function. When specialized to the empirical loss of twolayer ReLU networks, our new algorithm is the first practical and robust stationarity test approach, tackling an open issue in the work of Yun et al. [29, ICLR (2019)]. 
Lai Tian · Anthony ManCho So 🔗 


Multihead CLIP: Improving CLIP with Diverse Representations and Flat Minima
(
Poster
)
>
link
Contrastive LanguageImage Pretraining (CLIP) has shown remarkable success in the field of multimodal learning by enabling joint understanding of text and images. In this paper, we introduce a novel method called Multihead CLIP, inspired by Stein Variational Gradient Descent (SVGD) and Sharpnessaware Minimization (SAM). Our approach aims to enhance CLIP's learning capability by encouraging the model to acquire diverse features while also promoting convergence towards a flat loss region, resulting in improved generalization performance. We conduct extensive experiments on two benchmark datasets, YFCC15M and CC3M, to evaluate the effectiveness of our proposed method. The experimental results consistently demonstrate that multihead CLIP outperforms both the original CLIP architecture and CLIP with the SAM optimizer. 
Mo Zhou · Xiong Zhou · Erran Li Li · Stefano Ermon · Rong Ge 🔗 


DynaLay: An Introspective Approach to Dynamic Layer Selection for Deep Networks
(
Poster
)
>
link
Deep learning models have increasingly become computationally intensive, necessitating specialized hardware and significant runtime for both training and inference. In this work, we introduce DynaLay, a versatile and dynamic neural network architecture that employs a reinforcement learning agent to adaptively select which layers to execute for a given input. Our approach introduces an element of introspection into neural network architectures by enabling the model to recompute the results on more difficult inputs during inference, balancing the amount of expelled computation, optimizing for both performance and efficiency. The system comprises a main model constructed with FixedPoint Iterative (FPI) layers, which can approximate complex functions with high fidelity, and an agent that chooses among these layers or a nooperation (NOP) action. Unique to our approach is a multifaceted reward function that combines classification accuracy, computational time, and a penalty for redundant layer selection, thereby ensuring a harmonious tradeoff between performance and cost. Experimental results demonstrate that DynaLay achieves comparable accuracy to conventional deep models while significantly reducing computational overhead. Our approach represents a significant step toward creating more efficient, adaptable, and universally applicable deep learning systems. 
Mrinal Mathur · Sergey Plis 🔗 


Optimal Transport for Kernel Gaussian Mixture Models
(
Poster
)
>
link
The Wasserstein distance from optimal mass transport (OMT) is a powerful mathematical tool with numerous applications that provides a natural measure of the distance between two probability distributions. Several methods to incorporate OMT into widely used probabilistic models, such as Gaussian or Gaussian mixture, have been developed to enhance the capability of modeling complex multimodal densities of real datasets. However, very few studies have explored the OMT problems in a reproducing kernel Hilbert space (RKHS), wherein the kernel trick is utilized to avoid the need to explicitly map input data into a highdimensional feature space. In the current study, we propose a Wassersteintype metric to compute the distance between two Gaussian mixtures in a RKHS via the kernel trick, i.e., kernel Gaussian mixture models. 
Jung Hun Oh · Rena Elkin · Anish Simhal · Jiening Zhu · Joseph Deasy · Allen Tannenbaum 🔗 


Stochastic Optimization under Hidden Convexity
(
Poster
)
>
link
In this work, we consider stochastic nonconvex constrained optimization problems under hidden convexity, i.e., those that admit a convex reformulation via a black box (nonlinear, but invertible) map $c: \mathcal{X} \rightarrow \mathcal{U}$. A number of nonconvex problems ranging from optimal control, revenue and inventory management, to convex reinforcement learning all admit such a hidden convex structure. Unfortunately, in the majority of considered applications, the map $c(\cdot)$ is unavailable and therefore, the reduction to solving a convex optimization is not possible. On the other hand, the (stochastic) gradients with respect to the original variable $x\in \mathcal{X}$ are often easy to obtain. Motivated by these observations, we consider the projected stochastic (sub) gradient methods under hidden convexity and provide the first sample complexity guarantees for global convergence in smooth and nonsmooth settings. Additionally, we improve our results to the last iterate function value convergence in the smooth setting using the momentum variant of projected stochastic gradient descent.

Ilyas Fatkhullin · Niao He · Yifan Hu 🔗 


On Optimization Formulations of Finite Horizon MDPs
(
Poster
)
>
link
In this paper, we extend the connection between linear programming formulations of MDPs and policy gradient methods for infinite horizon MDPs presented in (Ying, L., & Zhu, Y., 2020) to finite horizon MDPs. The main tool we use for this extension is a reduction from optimization formulations of finite horizon MDPs to infinite horizon MDPs. Additionally, we show using a reparameterization argument that the KKT conditions for the nonconvex policy optimization problem for the finite horizon setting are sufficient for global optimality. Further, we use the reduction to extend the QuasiNewton policy gradient algorithm of (Li et. al 2021) to the finite horizon case and achieve performance competitive with value iteration by exploiting backward induction for policy evaluation. To our knowledge, this serves as the first policy gradientbased method for finite horizon MDPs that is competitive with value iterationbased approaches. 
Rajat Vadiraj Dwaraknath · Lexing Ying 🔗 


SelfTaught Optimizer (STOP): Recursively SelfImproving Code Generation
(
Poster
)
>
link
Several recent advances in AI systems (e.g., TreeofThoughts and ProgramAided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a languagemodelinfused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. Afterward, we analyze the variety of selfimprovement strategies proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive selfimprovement. Nonetheless, it demonstrates that a modern language model, GPT4 in our proofofconcept experiments, is capable of writing code that can call itself to improve itself. We critically consider concerns around the development of selfimproving technologies and evaluate the frequency with which the generated code bypasses a sandbox. 
Eric Zelikman · Eliana Lorch · Lester Mackey · Adam Tauman Kalai 🔗 


Learning MultiObjective Optimization Problem Through Online Learning
(
Poster
)
>
link
We investigate the problem of learning parameters (i.e., objective functions or constraints) of a multiobjective optimization problem, based on a set of sequentially arrived solutions. In particular, these solutions might not be exact and possibly carry measurement noise or are generated with the bounded rationality of decision makers. In this paper, we propose a general online learning framework to deal with this learning problem using inverse multiobjective optimization, and prove that this framework converges at a rate of under certain regularity conditions. More precisely, we develop two online learning algorithms with implicit update rules which can handle noisy data. Numerical results with both synthetic and real world datasets show that both algorithms can learn the parameters of a multiobjective program with great accuracy and are robust to noise. 
Chaosheng Dong · Yijia Wang · Bo Zeng 🔗 