M+Adam: Stable Low-Precision Training with Combined Adam--Madam Updates
Abstract
The scaling of the frontier Large Language Models (LLM) has led to impressive gains in language understanding, but is unsustainable in terms of hardware requirements. Quantization of weights and activations can help reduce costs, but training them in low precision is challenging and stable. Further, standard optimization methods like Adam still require keeping master copies of weights and optimization states in full precision. Is it possible to directly train LLMs in low precision without the need for storing full precision copies and reach the optimal perplexity? We address this affirmatively in this paper by designing a new optimization method that combines standard Adam with multiplicative updates: \textit{M{+}Adam} combines Adam and Multiplicative Adam (Madam) to update the mantissa and exponent of each floating-point parameter in parallel (Adam for the mantissa; Madam for the exponent). This dual-track update mechanism exploits the complementary strengths of the two optimizers: Adam provides fine-grained control over relative changes in the mantissa, whereas Madam supplies robust multiplicative updates that directly update the exponent, enabling the larger jumps often required under limited precision. This empirically reduces update variance in low precision, damping extreme fluctuations and enabling training entirely in low precision without storing full-precision master weights or optimizer updates, and also easing—and often eliminating—extensive hyperparameter tuning. We further prove that M{+}Adam is a descent method meaning that under standard smoothness assumptions, our parallel Adam+Madam updates guarantee a monotonic decrease in the composite loss. Under BF16 execution---using BF16 for both runtime operations and optimizer states without keeping a copy of full-precision master weights---Adam often fails (diverges or yields high perplexity), whereas M{+}Adam train stably and achieves nearly the perplexity of Adam trained with full precision master weights at higher batch sizes. This opens up avenues for training AI models fully in low precision, leading to higher hardware efficiency.