Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Optimization for ML Workshop

Memory Efficient Adaptive Stochastic Optimization via Subset-Norm

Thien H Nguyen · Huy Nguyen


Abstract: As deep neural networks grow larger, memory efficiency becomes crucial, with optimizer states of popular algorithms like Adam consuming substantial memory. This paper generalizes existing high-probability convergence analysis for AdaGrad and AdaGrad-Norm to arbitrary parameter partitions, encompassing both algorithms. We reveal a trade-off between coordinate-noise density and the convergence rate's dimensional dependency, suggesting an optimal grouping between the full coordinate version (AdaGrad) and the scalar version (AdaGrad-Norm). This insight leads to a principled compression approach called \textit{Subset-Norm}, targeting coordinate-wise second moment term in AdaGrad, RMSProp, and Adam. We demonstrate the empirical effectiveness of subset-norm step sizes in LLM pre-training tasks on LLaMA models, showing competitive performance to baselines like Adam while significantly reducing memory usage for the optimizer's state from O(d) to O(d) while introducing no additional hyperparameter.

Chat is not available.