Timezone: »
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Song Han
Fri Dec 02 08:30 AM -- 09:05 AM (PST) @
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy due to outliers or do not run efficiently on hardware. I’ll present SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs, including OPT-175B, BLOOM-176B and GLM-130B, achieving faster inference speed with half the number of GPUs. We hope SmoothQuant can inspire economic deployment of LLMs in the future.
Author Information
Song Han (MIT)
More from the Same Authors
-
2022 Poster: Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models »
Muyang Li · Ji Lin · Chenlin Meng · Stefano Ermon · Song Han · Jun-Yan Zhu -
2022 Poster: On-Device Training Under 256KB Memory »
Ji Lin · Ligeng Zhu · Wei-Ming Chen · Wei-Chen Wang · Chuang Gan · Song Han -
2021 Poster: Memory-efficient Patch-based Inference for Tiny Deep Learning »
Ji Lin · Wei-Ming Chen · Han Cai · Chuang Gan · Song Han -
2021 Poster: Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning »
Ligeng Zhu · Hongzhou Lin · Yao Lu · Yujun Lin · Song Han -
2020 Poster: MCUNet: Tiny Deep Learning on IoT Devices »
Ji Lin · Wei-Ming Chen · Yujun Lin · john cohn · Chuang Gan · Song Han -
2020 Spotlight: MCUNet: Tiny Deep Learning on IoT Devices »
Ji Lin · Wei-Ming Chen · Yujun Lin · john cohn · Chuang Gan · Song Han -
2020 Poster: Differentiable Augmentation for Data-Efficient GAN Training »
Shengyu Zhao · Zhijian Liu · Ji Lin · Jun-Yan Zhu · Song Han -
2020 Poster: TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning »
Han Cai · Chuang Gan · Ligeng Zhu · Song Han -
2019 : Hardware-aware Neural Architecture Design for Small and Fast Models: from 2D to 3D »
Song Han -
2019 Poster: Park: An Open Platform for Learning-Augmented Computer Systems »
Hongzi Mao · Parimarjan Negi · Akshay Narayan · Hanrui Wang · Jiacheng Yang · Haonan Wang · Ryan Marcus · Ravichandra Addanki · Mehrdad Khani Shirkoohi · Songtao He · Vikram Nathan · Frank Cangialosi · Shaileshh Venkatakrishnan · Wei-Hung Weng · Song Han · Tim Kraska · Dr.Mohammad Alizadeh -
2019 Poster: Deep Leakage from Gradients »
Ligeng Zhu · Zhijian Liu · Song Han -
2019 Poster: Point-Voxel CNN for Efficient 3D Deep Learning »
Zhijian Liu · Haotian Tang · Yujun Lin · Song Han -
2019 Spotlight: Point-Voxel CNN for Efficient 3D Deep Learning »
Zhijian Liu · Haotian Tang · Yujun Lin · Song Han -
2018 : Panel disucssion »
Max Welling · Tim Genewein · Edwin Park · Song Han -
2018 : Prof. Song Han »
Song Han -
2018 : Bandwidth efficient deep learning by model compression »
Song Han