Skip to yearly menu bar Skip to main content


Poster

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Haoran You · Yipin Guo · Yichao Fu · Wei Zhou · Huihong Shi · Xiaofan Zhang · Souvik Kundu · Amir Yazdanbakhsh · Yingyan (Celine) Lin

East Exhibit Hall A-C #1910
[ ] [ Project Page ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract: Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and significant latency bottlenecks. Shift and add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques necessitate training from scratch or full parameter fine-tuning to restore accuracy, which is often resource-intensive for LLMs. To this end, we propose accelerating pretrained LLMs through a $\textit{post-training}$ shift and add reparameterization to yield efficient multiplication-less LLMs, dubbed $\textbf{ShiftAddLLM}$. Specifically, we quantize and reparameterize each weight matrix in LLMs into a series of binary matrices and associated scaling factor matrices. Each scaling factor, corresponding to a group of weights, is quantized to powers of two. This reparameterization transforms the original multiplications between weights and activations into two steps: (1) bitwise shifts between activations and scaling factors, and (2) queries and adds these results according to the binary matrices. Further, to reduce the accuracy drop, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Finally, based on the observation of varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of $\textbf{5.6}$ and $\textbf{22.7}$ at comparable or even lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than $\textbf{80}$% memory and energy reductions over the original LLMs.

Live content is unavailable. Log in and register to view live content