Timezone: »
In this paper, a progressive low rank decomposition method is used to compress large-scale pre-trained transformer based language models. To this end, each fully-connected layers of the transformer modules are decomposed into two consecutive smaller ones using a progressive Singular Value Decomposition technique. In contrast to many of state-of-the-art compression methods where intensive pre-training of the compressed model is necessary, progressive LRD can provide promising performance by compressing the model in the fine-tuning stage. Furthermore, the current state-of-the-art model compression techniques usually face a limitation in their compression ratio as the accuracy gap becomes significant with compression ratios higher than 2×. We show that in later steps of the iterative compression where the decomposed models becomes much smaller than their original (compression factors larger than 8×), Knowledge Distillation can also be used to improve the performance.
Author Information
Habib Hajimolahoseini (Huawei Toronto Research Centre)
Mehdi Rezaghoizadeh (Huawei Technologies)
Vahid Partovi Nia (Huawei Noah's Ark Lab)
Marzieh Tahaei (Huawei Noah's Ark Lab)
Omar Mohamed Awad (Huawei Technologies)
Yang Liu (Huawei Canada)
More from the Same Authors
-
2021 : A Short Study on Compressing Decoder-Based Language Models »
Tianda Li · Yassir El Mesbahi · Ivan Kobyzev · Ahmad Rashid · Atif Mahmud · Nithin Anchuri · Habib Hajimolahoseini · Yang Liu · Mehdi Rezagholizadeh -
2021 : Kronecker Decomposition for GPT Compression »
Ali Edalati · Marzieh Tahaei · Ahmad Rashid · Vahid Partovi Nia · James J. Clark · Mehdi Rezaghoizadeh -
2022 : Strategies for Applying Low Rank Decomposition to Transformer-Based Models »
Habib Hajimolahoseini · Walid Ahmed · Mehdi Rezaghoizadeh · Vahid Partovi Nia · Yang Liu -
2022 : DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low Rank Adaptation »
Mojtaba Valipour · Mehdi Rezaghoizadeh · Ivan Kobyzev · Ali Ghodsi -
2022 : Improved Knowledge Distillation by Utilizing Backward Pass Knowledge in Neural Networks »
Aref Jafari · Mehdi Rezaghoizadeh · Ali Ghodsi -
2022 : Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement »
Heitor Guimarães · Arthur Pimentel · Anderson R. Avila · Mehdi Rezaghoizadeh · Tiago H Falk -
2022 : Attribute Controlled Dialogue Prompting »
Runcheng Liu · Ahmad Rashid · Ivan Kobyzev · Mehdi Rezaghoizadeh · Pascal Poupart -
2022 : Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement »
Heitor Guimarães · Arthur Pimentel · Anderson R. Avila · Mehdi Rezaghoizadeh · Tiago H Falk -
2022 : Attribute Controlled Dialogue Prompting »
Runcheng Liu · Ahmad Rashid · Ivan Kobyzev · Mehdi Rezaghoizadeh · Pascal Poupart -
2022 Poster: Is Integer Arithmetic Enough for Deep Learning Training? »
Alireza Ghaffari · Marzieh S. Tahaei · Mohammadreza Tayaranian · Masoud Asgharian · Vahid Partovi Nia -
2021 Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference) »
Mehdi Rezaghoizadeh · Lili Mou · Yue Dong · Pascal Poupart · Ali Ghodsi · Qun Liu -
2021 Poster: Demystifying and Generalizing BinaryConnect »
Tim Dockhorn · Yaoliang Yu · Eyyüb Sari · Mahdi Zolnouri · Vahid Partovi Nia -
2021 Poster: S$^3$: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks »
Xinlin Li · Bang Liu · Yaoliang Yu · Wulong Liu · Chunjing XU · Vahid Partovi Nia