Partial Parameter Updates for Efficient Distributed Training
Anastasiia Filippova · Angelos Katharopoulos · David Grangier · Ronan Collobert
Abstract
We propose a memory- and compute-efficient method for low-communication distributed training. As in prior approaches, we synchronize gradients infrequently, performing multiple local updates between communication rounds. In our approach, each node updates only a fixed, node-specific subset of parameters, keeping the remainder frozen. This restricted backpropagation lowers peak memory per device and total training FLOPs, while maintaining the same communication budget as existing low-communication methods. We demonstrate that, when training a $1.3$B-parameter language model across $32$ nodes, our method achieves perplexity comparable to prior low-communication approaches under identical token and bandwidth constraints, while using $15–21$% fewer FLOPs and $20–41$% less memory. Moreover, in bandwidth-constrained settings, our method achieves faster convergence than Distributed Data Parallel (DDP) with every-step synchronization.
Chat is not available.
Successful Page Load