DenseMixer: Improving MoE Post-Training with Precise Router Gradient
Abstract
Mixture-of-Experts (MoE) models are notoriously harder to train compared with dense models. Existing approaches either rely on imprecise router gradient or freeze router parameters entirely, limiting training effectiveness. We introduce DenseMixer, a novel MoE post-training technique that trades one extra forward pass on inactive experts for a more precise router gradient estimation. Our method consistently outperforms conventional methods across different MoE scales (7B, 14B, 30B), architectures (with/without shared experts), pre-training methods (from scratch/up-cycling), and post-training data types (instruction/long CoT data). It is universally applicable to any MoE using Top-K routing and can be used in a plug-and-play manner, compatible with existing training libraries and parameter-efficient methods like LoRA, introducing no changes to inference. We provide comprehensive empirical validation showing DenseMixer's effectiveness in improving MoE post-training quality while maintaining practical computational overhead.