Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Systems

TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation

Reza Yazdani Aminabadi · Connor Holmes · Samyam Rajbhandari · Zhewei Yao · Yuxiong He


Abstract: The Mixture of Experts (MoE) model is a powerful architecture that dynamically selects a subset of experts for each input, enabling the model to scale efficiently. However, the gating mechanism, which determines the assignment of tokens to experts, introduces 4-dimensional (S×E×C×MS×E×C×M) computational complexity due to its reliance on sparse representation which results in wasteful dense-computation. In this work, we present TurboMoE, a novel approach to accelerate MoE model training by optimizing the gating logic through smart kernel-fusion and data-layout transformations.Our method addresses the computational bottlenecks of the gating process by introducing three specialized kernels.The first kernel efficiently computes expert scores and performs top-k expert selection, while the second kernel scatters input tokens into expert-specific buffers, minimizing the need for sparse operations. Furthermore, we introduce the third MoE-Gather kernel, which replaces the traditional sparse matrix multiplication, streamlining the process of combining expert outputs.By integrating these kernels, TurboMoE achieves substantial end-to-end speedups over the state-of-the-art solution, MegaBlocks, with a 55\% faster training time for top-1 selection and a 41\% improvement for top-2 selection configurations. These optimizations significantly reduce the computation overhead of the gating functionality from O(SECM)O(SM). TurboMoE demonstrates that by removing the reliance on sparse computation, MoE models can achieve unprecedented training efficiency, reaching 460 Tera-Flops on 32 NVIDIA-H100 for a 32-expert MoE architecture with Top-2 gating configuration, paving the way for more scalable and effective applications.

Chat is not available.