Timezone: »

Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
Dongkuan (DK) Xu · Subhabrata Mukherjee · Xiaodong Liu · Debadeepta Dey · Wenhui Wang · Xiang Zhang · Ahmed Awadallah · Jianfeng Gao

Thu Dec 01 09:00 AM -- 11:00 AM (PST) @ Hall J #231

Traditional knowledge distillation (KD) methods manually design student architectures to compress large models given pre-specified computational cost. This requires several trials to find viable students, and repeating the process with change in computational budget. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Existing NAS methods train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Additionally, many of these works are task-specific requiring task labels for SuperLM training. Our framework AutoDistil addresses above challenges with the following steps: (a) Incorporates inductive bias and heuristics to partition Transformer search space into K compact sub-spaces (e.g., K=3 can generate typical student sizes of base, small and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic objective (e.g., self-attention distillation) with weight-sharing of students; (c) Lightweight search for the optimal student without re-training. Task-agnostic training and search allow students to be reused for fine-tuning on any downstream task. Experiments on GLUE benchmark demonstrate AutoDistil to outperform state-of-the-art KD and NAS methods with upto 3x reduction in computational cost and negligible loss in task performance. Code and model checkpoints are available at https://github.com/microsoft/autodistil.

Author Information

Dongkuan (DK) Xu (North Carolina State University)
Dongkuan (DK) Xu

Dongkuan (DK) Xu is an Assistant Professor in the CS Department at North Carolina State University and leads the NCSU Efficient & Intelligent Computing Lab. His research interest is resource-efficient deep learning for AI at scale, investigating how to improve the efficiency of deep learning systems to achieve Pareto optimality between computing resources and model performance. DK’s research has been published multiple times in top conferences and journals in AI, NLP, and other fields. He serves as the Column Editor for ACM SIGAI Newsletter, and will chair The First Workshop on DL-Hardware Co-Design for AI Acceleration with AAAI 2023. He has served as session chair for New Deep Learning Architectures, and for Scalable & Trustable AI at KDD 2022, and has served as a (senior) PC member and regular reviewer for over 30 major conferences and 15 journals. In addition, he has launched the Machine Learning Algorithms & Natural Language Processing community, which has over 500,000 followers worldwide. DK also has extensive research experience in industry. He has interned at Microsoft Research Redmond, Moffett AI, and NEC labs America, and holds 8 US patents/applications. DK's long-term research goal is to democratize AI to serve a broader range of populations and domains.

Subhabrata Mukherjee (Microsoft)
Xiaodong Liu (Microsoft)
Debadeepta Dey (Microsoft Research)

I am a researcher in the Adaptive Systems and Interaction (ASI) group led by Dr. Eric Horvitz at Microsoft Research, Redmond, USA. I finished my PhD at the Robotics Institute, Carnegie Mellon University, USA, where I was advised by Prof. J. Andrew (Drew) Bagnell. I do fundamental as well as applied research in machine learning, control and computer vision with applications to autonomous agents in general and robotics in particular. ​ My interests include decison-making under uncertainty, reinforcement learning, artificial intelligence and machine learning. As of January 2019 I am also serving as Affiliate Assistant Professor at The School of Computer Science and Engineering, University of Washington, Seattle, USA. I regularly review for NeurIPS, ICLR, ICML, ICRA, IROS, IJRR, JFR. On occasion for CVPR, ECCV, ICCV and Autonomous Robots.

Wenhui Wang (Microsoft Research)
Xiang Zhang (The Pennsylvania State University)
Ahmed Awadallah (Microsoft)
Jianfeng Gao (Microsoft Research, Redmond, WA)

More from the Same Authors