Poster
in
Workshop: OPT 2025: Optimization for Machine Learning

Extending $\mu$P: Spectral Conditions for Feature Learning Across Optimizers

Akshita Gupta · Marieme Ngom · Sam Foreman · Venkatram Vishwanath

Project Page [ OpenReview]

Abstract

Tuning hyperparameters (HPs) for large language models (LLMs) is computationally expensive. Maximal update parameterization ($\mu$P) offers width-independent scaling rules that stabilize HPs, but prior derivations for SGD and Adam rely on tensor programs, which are difficult to extend. Building on recent work that introduced spectral conditions as an alternative to tensor programs, we propose a framework to derive $\mu$P for a broader class of optimizers, including AdamW, ADOPT, LAMB, and Sophia. We validate our derivations on NanoGPT and further provide empirical insights into depth-scaling parameterization for these optimizers.

Chat is not available.