Contributed Talks 1: Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tail Class Imbalance & Muon Optimizes Under Spectral Norm Constraints
Abstract
Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tail Class Imbalance, Speaker: Robin Yadav
Abstract: Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-Euclidean norms, e.g., \ell{\infty} norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of \ell{\infty}-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. \ell{\infty} norm) over normalized GD (steepest descent w.r.t. to \ell2 norm) in the presence of heavy tail class imbalance.
Muon Optimizes Under Spectral Norm Constraints, Speaker: Jonathan Li,
Abstract: The pursuit of faster optimization algorithms remains an active and important research direction in deep learning. Recently, the Muon optimizer has demonstrated promising empirical performance, but its theoretical foundation remains less understood. In this paper, we bridge this gap and provide a theoretical analysis of Muon by placing it within the Lion-K family of optimizers. Specifically, we show that Muon corresponds to Lion-K when equipped with the nuclear norm, and we leverage the theoretical results of Lion-K to establish that Muon (with decoupled weight decay) implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices. This perspective not only demystifies the implicit regularization effects of Muon but also leads to natural generalizations through varying the choice of convex map K, allowing for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.