Timezone: »
Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, {\bf LLM.int8()}. We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9\% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open source our software.
Author Information
Tim Dettmers (University of Washington)
Mike Lewis (FAIR)
Younes Belkada (Ecole Normale Superieure)
Luke Zettlemoyer (University of Washington and Facebook)
More from the Same Authors
-
2023 Poster: Stable and low-precision training for large-scale vision-language models »
Mitchell Wortsman · Tim Dettmers · Luke Zettlemoyer · Ari Morcos · Ali Farhadi · Ludwig Schmidt -
2023 Poster: Toolformer: Language Models Can Teach Themselves to Use Tools »
Timo Schick · Jane Dwivedi-Yu · Roberto Dessi · Roberta Raileanu · Maria Lomeli · Eric Hambro · Luke Zettlemoyer · Nicola Cancedda · Thomas Scialom -
2023 Poster: Distributed Inference and Fine-tuning of Large Language Models Over The Internet »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2023 Poster: QLoRA: Efficient Finetuning of Quantized LLMs »
Tim Dettmers · Artidoro Pagnoni · Ari Holtzman · Luke Zettlemoyer -
2023 Poster: LIMA: Less Is More for Alignment »
Chunting Zhou · Pengfei Liu · Puxin Xu · Srinivasan Iyer · Jiao Sun · Yuning Mao · Xuezhe Ma · Avia Efrat · Ping Yu · LILI YU · Susan Zhang · Gargi Ghosh · Mike Lewis · Luke Zettlemoyer · Omer Levy -
2023 Poster: MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers »
LILI YU · Daniel Simig · Colin Flaherty · Armen Aghajanyan · Luke Zettlemoyer · Mike Lewis -
2023 Oral: Toolformer: Language Models Can Teach Themselves to Use Tools »
Timo Schick · Jane Dwivedi-Yu · Roberto Dessi · Roberta Raileanu · Maria Lomeli · Eric Hambro · Luke Zettlemoyer · Nicola Cancedda · Thomas Scialom -
2023 Oral: QLoRA: Efficient Finetuning of Quantized LLMs »
Tim Dettmers · Artidoro Pagnoni · Ari Holtzman · Luke Zettlemoyer -
2023 : Interactive Panel Discussion »
Tanya Roosta · Tim Dettmers · Minjia Zhang · Nazneen Rajani -
2023 : Keynote Talk 2 »
Luke Zettlemoyer -
2022 : Training Language Models to Negotiate in the Game of Diplomacy »
Mike Lewis -
2022 : Petals: Collaborative Inference and Fine-tuning of Large Models »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2022 : Petals: Collaborative Inference and Fine-tuning of Large Models »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2022 : 8-bit Methods for Efficient Deep Learning »
Tim Dettmers -
2022 Poster: Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models »
Kushal Tirumala · Aram Markosyan · Luke Zettlemoyer · Armen Aghajanyan -
2022 Poster: Improving Policy Learning via Language Dynamics Distillation »
Victor Zhong · Jesse Mu · Luke Zettlemoyer · Edward Grefenstette · Tim Rocktäschel -
2021 : Panel Discussion »
Pascal Poupart · Ali Ghodsi · Luke Zettlemoyer · Sameer Singh · Kevin Duh · Yejin Choi · Lu Hou -
2021 : Toward Efficient Training of Large Language Models with Balanced Conditional Compute »
Luke Zettlemoyer -
2021 Poster: Luna: Linear Unified Nested Attention »
Xuezhe Ma · Xiang Kong · Sinong Wang · Chunting Zhou · Jonathan May · Hao Ma · Luke Zettlemoyer -
2021 : Training Transformers Together »
Alexander Borzunov · Max Ryabinin · Tim Dettmers · quentin lhoest · Lucile Saulnier · Michael Diskin · Yacine Jernite · Thomas Wolf -
2021 Poster: SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark »
Victor Zhong · Austin W. Hanjie · Sida Wang · Karthik Narasimhan · Luke Zettlemoyer -
2020 : Invited talk - De-noising Sequence-to-Sequence Pre-training »
Luke Zettlemoyer -
2020 Poster: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks »
Patrick Lewis · Ethan Perez · Aleksandra Piktus · Fabio Petroni · Vladimir Karpukhin · Naman Goyal · Heinrich Küttler · Mike Lewis · Wen-tau Yih · Tim Rocktäschel · Sebastian Riedel · Douwe Kiela -
2020 Poster: Pre-training via Paraphrasing »
Mike Lewis · Marjan Ghazvininejad · Gargi Ghosh · Armen Aghajanyan · Sida Wang · Luke Zettlemoyer -
2019 Poster: Hierarchical Decision Making by Generating and Following Natural Language Instructions »
Hengyuan Hu · Denis Yarats · Qucheng Gong · Yuandong Tian · Mike Lewis -
2017 : End-to-end Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond »
Luke Zettlemoyer -
2008 Poster: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling -
2008 Spotlight: Multi-Agent Filtering with Infinitely Nested Beliefs »
Luke Zettlemoyer · Brian Milch · Leslie Kaelbling