Skip to yearly menu bar Skip to main content


Poster

SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices

Ruslan Svirschevski · Avner May · Zhuoming Chen · Beidi Chen · Zhihao Jia · Max Ryabinin

East Exhibit Hall A-C #4604
[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens in the same time as just one token, making it a natural fit for speculative decoding. Unfortunately, existing speculative decoding methods were not designed with this assumption and do not scale with a large budget of draft tokens. We study the inefficiencies of large-scale speculative decoding and design \textsc{SpecExec} (Speculative Execution), a simple parallel decoding strategy that can deliver on average up to 20 tokens per iteration for the LLaMA-2 70B models. Using this strategy, we propose a system that runs 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second when using 4-bit quantization or 2-3 tokens per second with 16-bit weights.

Live content is unavailable. Log in and register to view live content