StructMoE : Structured Mixture of Experts Using Low Rank Experts
Zain Sarwar · Ashwinee Panda · Benjamin Thérien · Stephen Rawls · Anirban Das · Kartik Balasubramaniam · Berkcan Kapusuzoglu · Shixiong Zhang · Sambit Sahu · MILIND NAPHADE · Supriyo Chakraborty
Keywords:
Efficient Architectures
Abstract
We introduce StructMoE, a method to scale MoE architectures by augmenting experts with dynamic capacity using structured matrices we call Low Rank Experts (LoRE). These LoREs are selected on a per-expert and per-token basis using a secondary router specific to every expert and are entangled with the main expert in the up-projection phase of the expert before the activation function. Empirically, we find this approach to outperform an MoE baseline in terms of loss on a held out validation set.
Video
Chat is not available.
Successful Page Load