A Llama Sunk My Battleship! Asking Rational Questions with LLMs via Bayesian Inference
Abstract
One of the hallmarks of an intelligent agent is the ability to ask good questions. While facility with language is clearly a prerequisite, even in simple settings, LLMs can struggle to come up with questions that yield useful information---suggesting a failure of grounded reasoning. We study this phenomenon in a question-asking task based on the classic board game Battleship, where both text-only and multimodal LLMs perform far below human baselines. We propose a Bayesian model that combines a LLM-driven prior over questions with a probabilistic world model to facilitate coherent reasoning. We find that with a surprisingly modest sample budget for “mental computation,” our method is well-calibrated to human performance across varied Battleship board scenarios. Notably, this approach allows much smaller LLMs, such as CodeLlama-7b, to perform on par with GPT-4. These results support the emerging trend toward test-time inference as a scaling route for LLM reasoning, while highlighting the utility of probabilistic world models for grounding and structuring such computations.