Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

Does It Know?: Probing and Benchmarking Uncertainty in Language Model Latent Beliefs

Brian Huang · Joe Kwon


Understanding a language model's beliefs about its truthfulness is crucial for building more trustworthy, factually accurate large language models. The recent method of Contrast-Consistent Search (CCS) measures this "latent belief" via a linear probe on intermediate activations of a language model, trained in an unsupervised manner to classify inputs as true or false. As an extension of CCS, we propose Uncertainty-detecting CCS (UCCS), which encapsulates finer-grained notions of truth, such as uncertainty or ambiguity. Concretely, UCCS teaches a probe, using only unlabeled data, to classify a model's latent belief on input text as true, false, or uncertain. We find that UCCS is an effective unsupervised-trained selective classifier, using its uncertainty class to filter out low-confidence truth predictions, leading to improved accuracy across a diverse set of models and tasks. To properly evaluate UCCS predictions of truth and uncertainty, we introduce a toy dataset, named Temporally Measured Events (TYMES), which comprises true or falsified facts, paired with timestamps, extracted from recent news articles from the past several years. TYMES can be combined with any language model's training cutoff date to systematically produce a subset of data beyond (literally, occurring after) the knowledge limitations of the model. TYMES serves as a valuable proof-of-concept for how we can benchmark uncertainty or time-sensitive world knowledge in language models, a setting which includes but extends beyond our UCCS evaluations.

Chat is not available.