Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

Summing Up the Facts: Additive Mechanisms behind Factual Recall in LLMs

Bilal Chughtai · Alan Cooney · Neel Nanda


How do large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form \tokens{Fact: The Colosseum is in the country of}. We find that the mechanistic story behind factual recall is more complex than previously thought -- We show there exist four distinct and independent mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the \textbf{additive motif}: models compute correct answers through adding together multiple independent contributions; the contributions from each mechanism may be insufficient alone, but together they constructively interfere on the correct attribute when summed. In addition, we extend the method of direct logit attribution to attribute a head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens.

Chat is not available.