Entropy Is Not Enough: Uncertainty Quantification for LLMs fails under Aleatoric Uncertainty
Abstract
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs)is critical for trustworthy deployment. While real-world language is inherentlyambiguous, existing UQ methods implicitly assume scenarios with no ambiguity.Therefore, a natural question is how they work under ambiguity. In this work,we demonstrate that current uncertainty estimators only perform well under therestrictive assumption of no aleatoric uncertainty and degrade significantly onambiguous data. Specifically, we provide theoretical insights into this limitationand introduce two question-answering (QA) datasets with ground-truth answerprobabilities. Using these datasets, we show that current uncertainty estimatorsperform close to random under real-world ambiguity. This highlights a fundamentallimitation in existing practices and emphasizes the urgent need for new uncertaintyquantification approaches that account for the ambiguity in language modeling.