Language Models use Lookbacks to Track Beliefs
Abstract
How do language models (LMs) represent characters’ beliefs, especially when those beliefs differ from reality? We analyze Llama-3-70B-Instruct on Theory of Mind (ToM) reasoning tasks. Using a dataset of short stories where characters act on objects with partial visibility, we uncover a pervasive algorithmic pattern that we call the \textit{lookback mechanism}. This mechanism allows LM to recall information when needed: it binds character–object–state triples via Ordering IDs (OIs) and retrieves them through pointer–address dereferencing in the residual stream. We identify three key lookbacks: binding, answer, and visibility. Our work provides insights into the LM's belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.