Timezone: »
The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable "sparse gate'" to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to 128 tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over 22% improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.
Author Information
Hussein Hazimeh (Google)
Zhe Zhao (Google)
Aakanksha Chowdhery (Google)
Maheswaran Sathiamoorthy (Google AI)
Yihua Chen (Google)
Rahul Mazumder (MIT)
Lichan Hong (Google)
Ed Chi (Google Inc.)
d H. Chi is a Principal Scientist at Google, leading several machine learning research teams focusing on neural modeling, inclusive ML, reinforcement learning, and recommendation systems in Google Brain team. He has delivered significant improvements for YouTube, News, Ads, Google Play Store at Google with >325 product launches in the last 6 years. With 39 patents and over 120 research articles, he is also known for research on user behavior in web and social media. Prior to Google, he was the Area Manager and a Principal Scientist at Palo Alto Research Center's Augmented Social Cognition Group, where he led the team in understanding how social systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota. Recognized as an ACM Distinguished Scientist and elected into the CHI Academy, he recently received a 20-year Test of Time award for research in information visualization. He has been featured and quoted in the press, including the Economist, Time Magazine, LA Times, and the Associated Press. An avid swimmer, photographer and snowboarder in his spare time, he also has a blackbelt in Taekwondo.
More from the Same Authors
-
2021 Spotlight: Efficiently Identifying Task Groupings for Multi-Task Learning »
Chris Fifty · Ehsan Amid · Zhe Zhao · Tianhe Yu · Rohan Anil · Chelsea Finn -
2021 : Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance »
Shibal Ibrahim · Natalia Ponomareva · Rahul Mazumder -
2022 Poster: Transformer Memory as a Differentiable Search Index »
Yi Tay · Vinh Tran · Mostafa Dehghani · Jianmo Ni · Dara Bahri · Harsh Mehta · Zhen Qin · Kai Hui · Zhe Zhao · Jai Gupta · Tal Schuster · William Cohen · Donald Metzler -
2022 : Network Pruning at Scale: A Discrete Optimization Approach »
Wenyu Chen · Riade Benbaki · Xiang Meng · Rahul Mazumder -
2022 : A Light-speed Linear Program Solver for Personalized Recommendation with Diversity Constraints »
Miao Cheng · Haoyue Wang · Aman Gupta · Rahul Mazumder · Sathiya Selvaraj · Kinjal Basu -
2022 : Improved Deep Neural Network Generalization Using m-Sharpness-Aware Minimization »
Kayhan Behdin · Qingquan Song · Aman Gupta · Sathiya Selvaraj · David Durfee · Ayan Acharya · Rahul Mazumder -
2022 : Towards Companion Recommendation Systems »
Konstantina Christakopoulou · Yuyan Wang · Ed Chi · MINMIN CHEN -
2022 Spotlight: Improving Multi-Task Generalization via Regularizing Spurious Correlation »
Ziniu Hu · Zhe Zhao · Xinyang Yi · Tiansheng Yao · Lichan Hong · Yizhou Sun · Ed Chi -
2022 : Invited Talk by Aakanksha Chowdhery »
Aakanksha Chowdhery -
2022 Poster: Improving Multi-Task Generalization via Regularizing Spurious Correlation »
Ziniu Hu · Zhe Zhao · Xinyang Yi · Tiansheng Yao · Lichan Hong · Yizhou Sun · Ed Chi -
2022 Poster: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models »
Jason Wei · Xuezhi Wang · Dale Schuurmans · Maarten Bosma · brian ichter · Fei Xia · Ed Chi · Quoc V Le · Denny Zhou -
2022 Poster: Pushing the limits of fairness impossibility: Who's the fairest of them all? »
Brian Hsu · Rahul Mazumder · Preetam Nandy · Kinjal Basu -
2021 Poster: Sparse is Enough in Scaling Transformers »
Sebastian Jaszczur · Aakanksha Chowdhery · Afroz Mohiuddin · LUKASZ KAISER · Wojciech Gajewski · Henryk Michalewski · Jonni Kanerva -
2021 Poster: Efficiently Identifying Task Groupings for Multi-Task Learning »
Chris Fifty · Ehsan Amid · Zhe Zhao · Tianhe Yu · Rohan Anil · Chelsea Finn -
2021 Poster: Improving Calibration through the Relationship with Adversarial Robustness »
Yao Qin · Xuezhi Wang · Alex Beutel · Ed Chi -
2020 : Invited Speaker: Ed Chi »
Ed Chi -
2020 Poster: Fairness without Demographics through Adversarially Reweighted Learning »
Preethi Lahoti · Alex Beutel · Jilin Chen · Kang Lee · Flavien Prost · Nithum Thain · Xuezhi Wang · Ed Chi -
2018 : Poster Session (All Posters) »
Stephen Macke · Hongzi Mao · Caroline Lemieux · Saim Salman · Rishikesh Jha · Hanrui Wang · Shoumik P Palkar · Tianqi Chen · Thomas Pumir · Vaishnav Janardhan · adit bhardwaj · Ed Chi -
2017 : Ed Chi (Google) on Learned Deep Retrieval for Recommenders »
Ed Chi