Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

Automatic Discovery of Visual Circuits

Achyuta Rajaram · Neil Chowdhury · Antonio Torralba · Jacob Andreas · Sarah Schwettmann


To date, most discoveries of subnetworks that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting the subgraph of a vision model’s computational graph that underlies a particular capability. In this paper, we formulate capabilities as mappings of human-interpretable visual concepts to intermediate feature representations. We introduce a new method for identifying these subnetworks: specifying a visual concept using a few examples, and then tracing the interdependence of neuron activations across layers, or their functional connectivity. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.

Chat is not available.