Sample, Don't Search: Rethinking Test-Time Alignment for Language Models
Gonçalo Faria · Noah Smith
Abstract
Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or requiring logit access. We demonstrate the effectiveness of QAlign, when applied with an RM trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-$n$, majority voting, and weighted majority voting on a range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.
Chat is not available.
Successful Page Load