Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability

Suchir Salhan · Konstantinos Voudouris

Project Page [ Slides] [ Poster] [ OpenReview]

Abstract

Mechanistic Interpretability aims to uncover the causal processes that explain language model behaviour by identifying internal representations and circuits. Yet the search space of possible alignment maps—functions linking hidden representations to hypothesized causal variables—remains essentially unconstrained. We show how this lack of structure limits reproducibility, generalisation, and the explanatory adequacy of causal claims. In this position paper, we argue that linguistic theory provides theory-driven templates for explicit, falsifiable constraints on candidate alignment maps to move from ad hoc circuit discovery toward a principled, hypothesis-driven science of causal abstraction. We offer case studies to highlight possible restrictions provided by linguistics on alignment maps, focusing on phonology and morphology and the contrastive properties of phoneme-level models, text-trained language models, and multimodal speech-text models. By framing interpretability as a hypothesis-driven, linguistically grounded science, we move closer to a program where mechanistic claims can be cumulative, comparative, and testable across models.

Chat is not available.