Extending $\mu$P: Spectral Conditions for Feature Learning Across Optimizers
Akshita Gupta · Marieme Ngom · Sam Foreman · Venkatram Vishwanath
Abstract
Tuning hyperparameters (HPs) for large language models (LLMs) is computationally expensive. Maximal update parameterization ($\mu$P) offers width-independent scaling rules that stabilize HPs, but prior derivations for SGD and Adam rely on tensor programs, which are difficult to extend. Building on recent work that introduced spectral conditions as an alternative to tensor programs, we propose a framework to derive $\mu$P for a broader class of optimizers, including AdamW, ADOPT, LAMB, and Sophia. We validate our derivations on NanoGPT and further provide empirical insights into depth-scaling parameterization for these optimizers.
Chat is not available.
Successful Page Load