Skip to yearly menu bar Skip to main content

Workshop: Causal Representation Learning

What's your Use Case? A Taxonomy of Causal Evaluations of Post-hoc Interpretability

David Reber · Victor Veitch

Keywords: [ Causality ] [ post-hoc interpretability ] [ evals ] [ Faithfulness ]


Post-hoc interpretability of Large Language Models (LLMs) often aims for mechanistic interpretations—detailed, causal accounts of model behavior. However, human interpreters may lack the capacity or willingness to formulate such intricate models, let alone evaluate them. This paper addresses this challenge by introducing a structured taxonomy grounded in the causal hierarchy. This taxonomy dissects the overarching goal of mechanistic interpretability into constituent claims, each requiring distinct evaluation methods. By doing so, we transform these evaluation criteria into actionable learning objectives, providing a data-driven pathway to interpretability. This framework enables a methodologically rigorous yet pragmatic approach to evaluating the strengths and limitations of various interpretability tools.

Chat is not available.