Foundations of Top-$k$ Decoding for Language Models
Georgy Noarov · Soham Mallick · Tao Wang · Sunay Joshi · Yan Sun · Yangxinyu Xie · Mengxin Yu · Edgar Dobriban
Abstract
Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with an $\ell_0$-regularization. Despite the combinatorial nature of this objective, we show it can be efficiently optimized: We prove that (1) the optimal decoding strategies are greedy, and (2) the objective is discretely convex in $k$, so that the optimal $k$ can be efficiently found. We show that top-$k$ decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).
Chat is not available.
Successful Page Load