Poster
in
Workshop: OPT 2025: Optimization for Machine Learning

Partial Parameter Updates for Efficient Distributed Training

Anastasiia Filippova · Angelos Katharopoulos · David Grangier · Ronan Collobert

Project Page [ OpenReview]

Abstract

We propose a memory- and compute-efficient method for low-communication distributed training. As in prior approaches, we synchronize gradients infrequently, performing multiple local updates between communication rounds. In our approach, each node updates only a fixed, node-specific subset of parameters, keeping the remainder frozen. This restricted backpropagation lowers peak memory per device and total training FLOPs, while maintaining the same communication budget as existing low-communication methods. We demonstrate that, when training a $1.3$B-parameter language model across $32$ nodes, our method achieves perplexity comparable to prior low-communication approaches under identical token and bandwidth constraints, while using $15–21$% fewer FLOPs and $20–41$% less memory. Moreover, in bandwidth-constrained settings, our method achieves faster convergence than Distributed Data Parallel (DDP) with every-step synchronization.

Chat is not available.