Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability
Abstract
Mechanistic Interpretability aims to uncover the causal processes that explain language model behaviour by identifying internal representations and circuits. Yet the search space of possible alignment maps—functions linking hidden representations to hypothesized causal variables—remains essentially unconstrained. We show how this lack of structure limits reproducibility, generalisation, and the explanatory adequacy of causal claims. In this position paper, we argue that linguistic theory provides theory-driven templates for explicit, falsifiable constraints on candidate alignment maps to move from ad hoc circuit discovery toward a principled, hypothesis-driven science of causal abstraction. We offer case studies to highlight possible restrictions provided by linguistics on alignment maps, focusing on phonology and morphology and the contrastive properties of phoneme-level models, text-trained language models, and multimodal speech-text models. By framing interpretability as a hypothesis-driven, linguistically grounded science, we move closer to a program where mechanistic claims can be cumulative, comparative, and testable across models.