

# FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo\*, Byungsoo Oh, Rachee Singh

**November 6, 2025** 



# Existing Distributed MoE implementations leave significant performance on the table!

**Claim** 

# Challenge 1: GPUs are idle for up to 90% on average!

Forward Pass | E = 64 | k = 2 | 2 A100s | ↑ is better















# Non-fused CPU-Driven Flow

# Network



# Kernel Fusion To Tackle DMoE inefficiencies

Network

GPU-Driven Flow



### **Kernel fusion:**

- eliminates kernel launch overheads.
- unlocks fine-grained overlap of communication.
- exploits data locality: eliding unnecessary HBM roundtrips.
- enables low-latency, high-bandwidth GPUinitiated communication.
- expands the design space for communication and compute optimizations.
- offloads task dependency management to the GPU, making implementations very difficult



Key Challenge: How do we implement lightweight task management for a completely fused DMoE kernel?

# FlashMoE



# The Novelty of FlashMoE

# All at blazing speed!

- First to fuse all DMoE communication and computation into a single kernel
- First in-kernel, actor-style OS with work-conserving scheduling
- Formalize task abstraction for tile-level parallelism
- Introduces a provably correct, non-blocking layout for inter-GPU PGAS



### Algorithm 1: Flash Distributed MoE Fused Kernel

```
Input: A, O \in \mathbb{R}^{S \times H}, E \in \mathbb{R}^{L \times H \times P}, N
1 begin
       T, G_{\phi} \leftarrow \mathbf{FusedGate}(A)
        if blockId + 1 < N then
            \mathbf{Dispatch}(T, A)
            processor::start()
        else
            if warpID == 0 then
                 scheduler::start()
            else
                 subscriber::start(E, O)
10
            end if
11
       end if
12
13 end
```

# Task Abstraction



# Symmetric Tensor Layout

# Non-blocking indexing



|                | $R_0$                 |                | R <sub>1</sub> |                       |
|----------------|-----------------------|----------------|----------------|-----------------------|
|                | B <sub>0</sub>        | B <sub>1</sub> | B <sub>0</sub> | B <sub>1</sub>        |
| P <sub>0</sub> | Eo                    | Eo             | Eo             | Eo                    |
|                | E <sub>1</sub>        | E₁             | E <sub>1</sub> | E <sub>1</sub>        |
| P <sub>1</sub> | E <sub>2</sub>        | Eo             | Eo             | E <sub>2</sub>        |
|                | <b>E</b> <sub>3</sub> | E <sub>1</sub> | E <sub>1</sub> | <b>E</b> <sub>3</sub> |

|                | $R_0$                 |                       | R <sub>1</sub>        |                       |
|----------------|-----------------------|-----------------------|-----------------------|-----------------------|
|                | B <sub>0</sub>        | B <sub>1</sub>        | Bo                    | B <sub>1</sub>        |
| P <sub>0</sub> | Eo                    | E <sub>2</sub>        | E <sub>2</sub>        | Eo                    |
|                | E <sub>1</sub>        | E <sub>3</sub>        | <b>E</b> <sub>3</sub> | E <sub>1</sub>        |
| P <sub>1</sub> | E <sub>2</sub>        | E <sub>2</sub>        | E <sub>2</sub>        | E <sub>2</sub>        |
|                | <b>E</b> <sub>3</sub> | <b>E</b> <sub>3</sub> | <b>E</b> <sub>3</sub> | <b>E</b> <sub>3</sub> |

**Theorem 1.1.** L is write-write conflict-free.

# Evaluation

# **Experimental Setup**

- 4 Baselines: COMET, Megatron-[CUTLASS, TE], FasterMoE
- Flash: FP32, baselines: FP16.
- Testbed: 8 NVLink H100 RunPod VM.
- All results: averaged across 32 runs and preceded by 32 warmups
- Only forward pass

# What we Evaluate



E2E Latency

Experts Scalability

Communication Efficiency

Table 1: Implementation metrics of *FlashMoE* using inlined NVSHMEM 3.2.5 on SM 80

| Metric                          | Value          |
|---------------------------------|----------------|
| Total lines of code (CUDA/C++)  | 6820           |
| Kernel stack frame size         | $0~\mathrm{B}$ |
| Spill stores (per thread)       | 0              |
| Spill loads (per thread)        | 0              |
| Shared memory usage (per block) | 46 KB          |
| Registers per thread            | 255            |
| Max active blocks per SM        | <b>2</b>       |
| Compilation time                | 53 seconds     |
| Binary size                     | 29 MB          |

### FlashMoE eliminates DMoE launch overheads!

| Works                        | Launched GPU Ops |
|------------------------------|------------------|
| $\overline{FlashMoE}$ (ours) | 1                |
| COMET                        | 33               |
| Megatron-LM CUTLASS          | 85               |
| Megatron-LM TE               | 261              |
| Megatron-LM + DeepEP         | 432              |
| DeepSpeedMoE                 | 550              |

Table 2: **Kernel Fusion Comparison.** We report GPU operations of from detailed profiling with Nsight Systems. Operations were from an MoE forward pass across 2 GPUs with 64 total experts.

# FlashMoE achieves 9x higher GPU Utilization!



# FlashMoE is 4.8x faster on 4 GPUs



### FlashMoE is 6x faster on 8 GPUs!



### FlashMoE has uniform latency as experts increase!



### FlashMoE has uniform latency as experts increase!



# FlashMoE gives > 89% Communication Efficiency, 4x higher than baselines!



# Conclusion

### **Complete DMoE Kernel Fusion**

### FlashMoE gives:

- 9x higher GPU utilization
- 6x faster E2E latency
- Constant expert scalability
- 4x better communication efficiency



**GPU SM Utilization** 



Scaling Tokens (4 GPUs)



Scaling Experts (4 GPUs)



Scaling GPUs



Scaling Tokens (8 GPUs)



Scaling Experts (8 GPUs)