Simultaneous Fine-Tuning and Pruning of LLMs
Abstract
Fine-tuning large language models (LLMs) for specific downstream tasks enables exceptional performance; unfortunately, the vast model sizes hinder deployment in hardware-constrained environments. Hence, small, domain-specific models are created for such scenarios. This is usually done in a two-stage process by first pruning a LLM and fine-tuning (FT) afterwards.However, performing these two steps jointly may yield better results, as both FT and pruning can then adapt to each other.Motivated by this potential, we propose a method based on constrained optimization that uses augmented Lagrangian methods to simultaneously fine-tune and prune (SFP) LLMs to a target sparsity.Our approach is directly compatible with parameter-efficient fine-tuning (PEFT) techniques and can be applied to structures of different granularities.We evaluate the effectiveness of the method against state-of-the-art pruning techniques and show similar or better performance. Specifically, SFP can prune a 7 billion parameter model to 50\% sparsity and achieve a 1.88 times faster inference speed with negligible performance degradation.