Timezone: »

A Short Study on Compressing Decoder-Based Language Models
Tianda Li · Yassir El Mesbahi · Ivan Kobyzev · Ahmad Rashid · Atif Mahmud · Nithin Anchuri · Habib Hajimolahoseini · Yang Liu · Mehdi Rezagholizadeh

Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art (SOTA) of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) We employ current SOTA knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) We pre-train a compressed GPT-2 model using layer truncation and compare it against distillation-based methods. The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.

Author Information

Tianda Li (Noah's ark lab (Montreal))
Yassir El Mesbahi (Huawei)
Ivan Kobyzev (Huawei Noah's Ark Lab)
Ahmad Rashid (Huawei Technologies)
Atif Mahmud (Huawei Noah's Ark Lab)
Nithin Anchuri (Huawei Noah's Ark Lab)
Habib Hajimolahoseini (Huawei Toronto Research Centre)
Yang Liu (Huawei Canada)
Mehdi Rezagholizadeh (Huawei Technologies)

More from the Same Authors