Adaptive Inference Scaling via Monte Carlo Sampling
Abstract
LLM inference time scaling has emerged as an important paradigm for training-free alignment of LLMs using external reward signals. However, central questions regarding practical deployment, such as answer selection methods and optimal compute allocation, remain poorly understood, with advancements primarily driven by empirical heuristics. To address this, we provide a principled framework for analyzing inference time scaling via Monte Carlo (MC) sampling. This framework treats inference scaling as a statistical estimation problem over a reward weighted posterior, and introduces principled choices for response selection and compute allocation strategies. Experiments on mathematical reasoning benchmarks show that (i) our MC derived inference scaling methods outperform baseline strategies, (ii) our adaptive inference scaling strategy dynamically adjusts compute on per-query basis, allocating more compute to challenging prompts.