Skip to yearly menu bar Skip to main content


Poster

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful

Martin Marek ⋅ Sanae Lotfi ⋅ Aditya Somasundaram ⋅ Andrew Wilson ⋅ Micah Goldblum
2025 Poster

Abstract

Video

Chat is not available.