Timezone: »
Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we make an intriguing observation that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers. We show that this phenomenon can be partially explained by the information bottleneck principle: the feature representations of intermediate models can have higher mutual information regarding the input, and thus contain more "dark knowledge'' for effective distillation. We further propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information. Experiments verify its effectiveness and applicability.
Author Information
Chaofei Wang (Tsinghua University)
Qisen Yang (Department of Automation, Tsinghua University)
Rui Huang (Tsinghua University, Tsinghua University)
Shiji Song (Department of Automation, Tsinghua University)
Gao Huang (Cornell University)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Efficient Knowledge Distillation from Model Checkpoints »
Dates n/a. Room
More from the Same Authors
-
2022 Poster: Contrastive Language-Image Pre-Training with Knowledge Graphs »
Xuran Pan · Tianzhu Ye · Dongchen Han · Shiji Song · Gao Huang -
2022 Poster: Provable General Function Class Representation Learning in Multitask Bandits and MDP »
Rui Lu · Andrew Zhao · Simon Du · Gao Huang -
2022 Poster: A Mixture Of Surprises for Unsupervised Reinforcement Learning »
Andrew Zhao · Matthieu Lin · Yangguang Li · Yong-jin Liu · Gao Huang -
2022 : Boosting Offline Reinforcement Learning via Data Resampling »
Yang Yue · Bingyi Kang · Xiao Ma · Zhongwen Xu · Gao Huang · Shuicheng Yan -
2023 : Facilitating Battery Swapping Services for Freight Trucks with Spatial-Temporal Demand Prediction »
Linyu Liu · Zhen Dai · Shiji Song · Xiaocheng Li · Guanting Chen -
2023 Poster: Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL »
Yang Yue · Rui Lu · Bingyi Kang · Shiji Song · Gao Huang -
2023 Poster: Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning »
Shenzhi Wang · Qisen Yang · Jiawei Gao · Matthieu Lin · HAO CHEN · Liwei Wu · Ning Jia · Shiji Song · Gao Huang -
2022 Spotlight: Lightning Talks 4A-4 »
Yunhao Tang · LING LIANG · Thomas Chau · Daeha Kim · Junbiao Cui · Rui Lu · Lei Song · Byung Cheol Song · Andrew Zhao · Remi Munos · Ćukasz Dudziak · Jiye Liang · Ke Xue · Kaidi Xu · Mark Rowland · Hongkai Wen · Xing Hu · Xiaobin Huang · Simon Du · Nicholas Lane · Chao Qian · Lei Deng · Bernardo Avila Pires · Gao Huang · Will Dabney · Mohamed Abdelfattah · Yuan Xie · Marc Bellemare -
2022 Spotlight: Provable General Function Class Representation Learning in Multitask Bandits and MDP »
Rui Lu · Andrew Zhao · Simon Du · Gao Huang -
2022 Spotlight: Lightning Talks 1B-3 »
Chaofei Wang · Qixun Wang · Jing Xu · Long-Kai Huang · Xi Weng · Fei Ye · Harsh Rangwani · shrinivas ramasubramanian · Yifei Wang · Qisen Yang · Xu Luo · Lei Huang · Adrian G. Bors · Ying Wei · Xinglin Pan · Sho Takemori · Hong Zhu · Rui Huang · Lei Zhao · Yisen Wang · Kato Takashi · Shiji Song · Yanan Li · Rao Anwer · Yuhei Umeda · Salman Khan · Gao Huang · Wenjie Pei · Fahad Shahbaz Khan · Venkatesh Babu R · Zenglin Xu -
2022 Poster: Latency-aware Spatial-wise Dynamic Networks »
Yizeng Han · Zhihang Yuan · Yifan Pu · Chenhao Xue · Shiji Song · Guangyu Sun · Gao Huang -
2021 Poster: Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition »
Yulin Wang · Rui Huang · Shiji Song · Zeyi Huang · Gao Huang -
2020 Poster: Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification »
Yulin Wang · Kangchen Lv · Rui Huang · Shiji Song · Le Yang · Gao Huang -
2019 Poster: Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning »
Wenjie Shi · Shiji Song · Hui Wu · Ya-Chu Hsu · Cheng Wu · Gao Huang -
2019 Poster: Implicit Semantic Data Augmentation for Deep Networks »
Yulin Wang · Xuran Pan · Shiji Song · Hong Zhang · Gao Huang · Cheng Wu -
2016 Poster: Supervised Word Mover's Distance »
Gao Huang · Chuan Guo · Matt J Kusner · Yu Sun · Fei Sha · Kilian Weinberger -
2016 Oral: Supervised Word Mover's Distance »
Gao Huang · Chuan Guo · Matt J Kusner · Yu Sun · Fei Sha · Kilian Weinberger