Poster
in
Workshop: Structured Probabilistic Inference and Generative Modeling

Partial Resolution: Optimal Multi-Draft Speculative Sampling in Near-Linear Time

Rahul Thomas · Arka Pal

Project Page [ OpenReview]

Abstract

Speculative sampling reduces the latency of autoregressive decoding in LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is the optimal transport (OT), which must be explicitly computed in any multi-draft speculative sampling algorithm for optimal decoding efficiency. However, finding the OT is difficult, as it is the solution of a linear program (OTLP) with over $V^n$ variables, $V$ being the vocabulary size. Thus, two recent theoretical works have sought to reduce OTLP complexity by reframing it as importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so there is still a significant gap in the state-of-the-art. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. Next, with the insight that we only need to solve for some OT parameters, we reduce the OTLP complexity from exponential to linear. This allows us to devise an $O(2^{1.5n} n^2 + 2^n V + V \log V)$ algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from one draft model. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 97\% acceptance and under 2 ms of overhead per generated token.

Chat is not available.