Timezone: »

Deep Compression of Pre-trained Transformer Models
Naigang Wang · Chi-Chun (Charlie) Liu · Swagath Venkataramani · Sanchari Sen · Chia-Yu Chen · Kaoutar El Maghraoui · Vijayalakshmi (Viji) Srinivasan · Leland Chang

Wed Nov 30 09:00 AM -- 11:00 AM (PST) @ Hall J #630

Pre-trained transformer models have achieved remarkable success in natural language processing (NLP) and have recently become competitive alternatives to Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN) in vision and speech tasks, respectively. Due to excellent computational efficiency and scalability, transformer models can be trained on exceedingly large amounts of data; however, model sizes can grow tremendously. As high performance, large-scale, and pre-trained transformer models become available for users to download and fine-tune for customized downstream tasks, the deployment of these models becomes challenging due to the vast amount of operations and large memory footprint. To address this challenge, we introduce methods to deeply compress pre-trained transformer models across three major application domains: NLP, speech, and vision. Specifically, we quantize transformer backbones down to 4-bit and further achieve 50% fine-grained structural sparsity on pre-trained BERT, Wav2vec2.0 and Vision Transformer (ViT) models to achieve 16x compression while maintaining model accuracy. This is achieved by identifying the critical initialization for quantization/sparsity aware fine-tuning, as well as novel techniques including quantizers with zero-preserving format and scheduled dropout. These hardware-friendly techniques need only to be applied in the fine-tuning phase for downstream tasks; hence, are especially suitable for acceleration and deployment of pre-trained transformer models.

Author Information

Naigang Wang (IBM T. J. Watson Research Center)
Chi-Chun (Charlie) Liu (IBM Research)
Swagath Venkataramani (IBM Research)
Sanchari Sen (International Business Machines)
Chia-Yu Chen (IBM research)

my research areas focus on: accelerator architecture compiler design and library development machine learning and neural network VLSI and nano device

Kaoutar El Maghraoui (IBM Research)

Dr. Kaoutar El Maghraoui is a principal research scientist at the IBM T.J Watson Research Center where she is focusing on innovations at the intersection of systems and artificial intelligence (AI). She leads the research agenda of End-Use experimental AI testbed of the IBM Research AI Hardware Center, a global research hub focusing on enabling next-generation accelerators and systems for AI workloads. . She co-led IBM’s Global Technology Outlook in 2017 where she contributed to creating IBM’s vision for the future of IT across global labs and business units focusing on IBM’s AI leadership. Kaoutar has co-authored several patents, conference, and journal publications in the areas of systems research, distributed systems, high performance computing, and AI. Kaoutar holds a PhD. degree from Rensselaer Polytechnic Institute, USA. She received several awards including the Robert McNaughton Award for best thesis in computer science, IBM’s Eminence and Excellence award for leadership in increasing Women’s presence in science and technology, and 2 IBM outstanding technical accomplishments. Kaoutar is global vice-chair of the Arab Women in Computing organization and avid supporter and volunteers of several women in science and technology initiatives.

Vijayalakshmi (Viji) Srinivasan (IBM TJ Watson)
Leland Chang (IBM, International Business Machines)

More from the Same Authors