NeurIPS TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation

Poster
in
Workshop: Machine Learning for Systems

TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation

Reza Yazdani Aminabadi · Connor Holmes · Samyam Rajbhandari · Zhewei Yao · Yuxiong He

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: The Mixture of Experts (MoE) model is a powerful architecture that dynamically selects a subset of experts for each input, enabling the model to scale efficiently. However, the gating mechanism, which determines the assignment of tokens to experts, introduces 4-dimensional (

S \times E \times C \times M

$S\times E\times C\times M$ ) computational complexity due to its reliance on sparse representation which results in wasteful dense-computation. In this work, we present TurboMoE, a novel approach to accelerate MoE model training by optimizing the gating logic through smart kernel-fusion and data-layout transformations.Our method addresses the computational bottlenecks of the gating process by introducing three specialized kernels.The first kernel efficiently computes expert scores and performs top-k expert selection, while the second kernel scatters input tokens into expert-specific buffers, minimizing the need for sparse operations. Furthermore, we introduce the third MoE-Gather kernel, which replaces the traditional sparse matrix multiplication, streamlining the process of combining expert outputs.By integrating these kernels, TurboMoE achieves substantial end-to-end speedups over the state-of-the-art solution, MegaBlocks, with a 55\% faster training time for top-1 selection and a 41\% improvement for top-2 selection configurations. These optimizations significantly reduce the computation overhead of the gating functionality from O(

${SECM}) \rightarrow O({SM}$ ). TurboMoE demonstrates that by removing the reliance on sparse computation, MoE models can achieve unprecedented training efficiency, reaching 460 Tera-Flops on 32 NVIDIA-H100 for a 32-expert MoE architecture with Top-2 gating configuration, paving the way for more scalable and effective applications.

Chat is not available.

Poster in Workshop: Machine Learning for Systems

TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation

Reza Yazdani Aminabadi · Connor Holmes · Samyam Rajbhandari · Zhewei Yao · Yuxiong He

Poster
in
Workshop: Machine Learning for Systems