Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

Language models can associate objects with their features without forming integrated representations

Jerome Han · James McClelland

Project Page [ OpenReview]

Abstract

Language models (LMs) are adept at using in-context learning to associate objects with their features -- for example, given an input such as 'the table is small and orange and the phone is green and big, so the color of the phone is', they can correctly infer that the next word should be `green'. How? One possibility is that they rely on integrated representations of objects that jointly encode feature information within specific token positions. Another is that they access disparate sets of feature-specific representations distributed across positions as needed. By applying causal mediation analysis on an LM performing a multi-object multi-feature association task, we find a small set of upper-layer attention heads that search for and copy feature-specific representations based on the demands of a specific query. These heads are sufficient and necessary for our task, suggesting that LMs do not rely on integrated object representations to complete it.

Chat is not available.