Timezone: »

Mixtape: Breaking the Softmax Bottleneck Efficiently
Zhilin Yang · Thang Luong · Russ Salakhutdinov · Quoc V Le

Wed Dec 11 05:00 PM -- 07:00 PM (PST) @ East Exhibition Hall B + C #107

The softmax bottleneck has been shown to limit the expressiveness of neural language models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques---logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.

Author Information

Zhilin Yang (Recurrent AI)
Thang Luong (Google Brain)
Russ Salakhutdinov (Carnegie Mellon University)
Quoc V Le (Google)

More from the Same Authors