Skip to yearly menu bar Skip to main content


Poster

Optimal ablation for model internals

Maximilian Li · Lucas Janson


Abstract:

Interpretability work often involves tracing the flow of information through machine learning models to identify the specific model components that perform relevant computations for tasks of interest. One important concept for localizing model behavior is component ablation, or simulating the removal of some model components to isolate specific causal relationships. Previous work simulates component removal with a variety of heuristic causal interventions, such as adding Gaussian noise or setting values to zeros or their means over an input distribution, to quantify the relative importance of various model components for performance on interpretable subtasks. We argue for the adoption of optimal ablation of activations for studying model internals and show that it has theoretical and empirical advantages over popular methods for component ablation. We show that optimal ablation can improve algorithmic circuit discovery---the identification of sparse subnetworks that recover low loss on interpretable subtasks---and produces tools that benefit other use cases related to model internals, including localization of factual recall and prediction with latent representations.

Live content is unavailable. Log in and register to view live content