Poster
in
Workshop: Structured Probabilistic Inference and Generative Modeling

DenseMixer: Improving MoE Post-Training with Precise Router Gradient

Feng Yao · Junxia Cui · Ruohan Zhang · Liyuan Liu · Shibo Hao · Li Zhang · Chengyu Dong · Shuohang Wang · yelong shen · Jianfeng Gao · Jingbo Shang

Project Page [ OpenReview]

Abstract

Mixture-of-Experts (MoE) models are notoriously harder to train compared with dense models. Existing approaches either rely on imprecise router gradient or freeze router parameters entirely, limiting training effectiveness. We introduce DenseMixer, a novel MoE post-training technique that trades one extra forward pass on inactive experts for a more precise router gradient estimation. Our method consistently outperforms conventional methods across different MoE scales (7B, 14B, 30B), architectures (with/without shared experts), pre-training methods (from scratch/up-cycling), and post-training data types (instruction/long CoT data). It is universally applicable to any MoE using Top-K routing and can be used in a plug-and-play manner, compatible with existing training libraries and parameter-efficient methods like LoRA, introducing no changes to inference. We provide comprehensive empirical validation showing DenseMixer's effectiveness in improving MoE post-training quality while maintaining practical computational overhead.

Chat is not available.