Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models

Do Language Models Know When They're Hallucinating References?

Ayush Agrawal · Mirac Suzgun · Lester Mackey · Adam Tauman Kalai


State-of-the-art language models (LMs) are famous for "hallucinating'' references. These fabricated article and book titles lead to harms, obstacles to their use, and public backlash. While other types of LM hallucinations are also important, we propose hallucinated references as the "drosophila'' of research on hallucination in large language models (LLMs), as they are particularly easy to study. We show that simple search engine queries reliably identify such hallucinations, which facilitates evaluation. To begin to dissect the nature of hallucinated LM references, we attempt to classify them using black-box queries to the same LM, without consulting any external resources. Consistency checks done with direct queries about whether the generated reference title is real (inspired by Kadavath et al. (2022), Lin et al. (2022) and Manakul (2023)) are compared to consistency checks with indirect queries which ask for ancillary details such as the authors of the work. These consistency checks are found to be partially reliable indicators of whether or not the reference is a hallucination.In particular, we find that LMs often hallucinate differing authors of hallucinated references when queried in independent sessions, while consistently identify authors of real references. This suggests that the hallucination may be more a generation issue than inherent to current training techniques or representation.

Chat is not available.