Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

James Michaelov ⋅ Catherine Arnett

Project Page [ OpenReview]

Abstract

Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.

Chat is not available.