Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

Language Models use Lookbacks to Track Beliefs

Nikhil Prakash · Natalie Shapira · Arnab Sen Sharma · Christoph Riedl · Yonatan Belinkov · Tamar Rott Shaham · David Bau · Atticus Geiger

Project Page [ OpenReview]

Abstract

How do language models (LMs) represent characters’ beliefs, especially when those beliefs differ from reality? We analyze Llama-3-70B-Instruct on Theory of Mind (ToM) reasoning tasks. Using a dataset of short stories where characters act on objects with partial visibility, we uncover a pervasive algorithmic pattern that we call the \textit{lookback mechanism}. This mechanism allows LM to recall information when needed: it binds character–object–state triples via Ordering IDs (OIs) and retrieves them through pointer–address dereferencing in the residual stream. We identify three key lookbacks: binding, answer, and visibility. Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

Chat is not available.