Skip to yearly menu bar Skip to main content


ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Zhewei Yao · Reza Yazdani Aminabadi · Minjia Zhang · Xiaoxia Wu · Conglong Li · Yuxiong He

Hall J (level 1) #234

Keywords: [ GPT-Neox-20B ] [ GPT-3 ] [ Layer-by-Layer Knowledge Distillation ] [ BERT ] [ post-training quantization ]


How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as \OURS. \OURS is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (\lwd) even without the original training data access;(3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead.As such, we are able to show that:(1) \OURS can reduce the precision for weight and activations to INT8 in a cost-free way for both \bert and \gpt-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on \bert/\gpt-style models compared to FP16 inference, separately;(2) \OURS plus \lwd can affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model;(3) \OURS can be directly applied to two of the largest open-sourced language models, including \gptneox, for which our INT8 model achieves similar accuracy as the FP16 model but achieves 5.2x better efficiency.Our code is open-sourced at~\cite{code_compression}.

Chat is not available.