[Paper-Oral 8] LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants
Abstract
Quantization is an indispensable technique for serving Large Language Models(LLMs) and has recently found its way into LoRA fine-tuning (Dettmers et al.,2023). In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is commonto observe a consistent gap in the performance on downstream tasks between fullfine-tuning and quantization plus LoRA fine-tuning approach. In response, wepropose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantizationframework that simultaneously quantizes an LLM and finds a proper low-rankinitialization for LoRA fine-tuning. Such an initialization alleviates the discrep-ancy between the quantized and full-precision model and significantly improvesthe generalization in downstream tasks. We evaluate our method on natural lan-guage understanding, question answering, summarization, and natural languagegeneration tasks. Experiments show that our method is highly effective and out-performs existing quantization methods, especially in the challenging 2-bit and2/4-bit mixed precision regimes. We will release our code.