Uncertainty Quantification for Language Models: Standardizing and Evaluating Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Abstract
Large language models often hallucinate, creating trust and safety concerns in high-stakes settings. We present a generation-time, zero-resource framework for hallucination detection that unifies black-box, white-box, and judge signals. Heterogeneous outputs are standardized to a shared 0 to 1 confidence scale for ranking and thresholding. To enhance flexibility, we introduce a simple, extensible ensemble with non-negative weights tuned on a graded set of LLM responses. Across six QA benchmarks and four generators, the ensemble outperforms the best individual scorer in 18 of 24 AUROC cases and 16 of 24 F1 cases. Among non-ensemble scorers, entailment-style black-box methods are strong baselines, although they incur higher generation costs and lack effectiveness when variation in sampled responses is low. The framework supports practical actions such as blocking low-confidence responses or routing to human review. We release an open-source Python library providing ready-to-use implementations of all methods.