Timezone: »

 
Models with Conditional Computation Learn Suboptimal Solutions
Mohammed Muqeeth · Haokun Liu · Colin Raffel
Event URL: https://openreview.net/forum?id=s9wWlWOUVF9 »

Sparsely-activated neural networks with conditional computation learn to route their inputs through different subnetworks, providing a strong structural prior and reducing computational costs.Despite their possible benefits, models with learned routing often underperform their parameter-matched densely-activated counterparts as well as models that use non-learned heuristic routing strategies.In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely-activated models with non-differentiable discrete routing decisions.To test this hypothesis, we evaluate the performance of sparsely-activated models trained with various gradient estimation techniques in three settings where a high-quality heuristic routing strategy can be designed.Our experiments reveal that learned routing reaches substantially worse solutions than heuristic routing in various settings.As a first step towards remedying this gap, we demonstrate that supervising the routing decision on a small fraction of the examples is sufficient to help the model to learn better routing strategies. Our results shed light on the difficulties of learning effective routing and set the stage for future work on conditional computation mechanisms and training techniques.

Author Information

Mohammed Muqeeth (University of North Carolina at Chapel Hill)
Mohammed Muqeeth

I am interested in applying machine learning to solve NLP tasks efficiently

Haokun Liu (Department of Computer Science, University of North Carolina, Chapel Hill)
Colin Raffel (UNC Chapel Hill and Hugging Face)

More from the Same Authors