Timezone: »
Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate methods that use pretrained language models (LMs) to produce code-based and natural language descriptions of function behavior. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built with an off-the-shelf LM augmented with black-box access to functions, can sometimes infer function structure—acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, FIND also reveals that LM-based descriptions capture global function behavior while missing local details. These results suggest that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.
Author Information
Sarah Schwettmann (M.I.T.)
Tamar Shaham (Massachusetts Institute of Technology)
Joanna Materzynska (Massachusetts Institute of Technology)
Neil Chowdhury (Massachusetts Institute of Technology)
Shuang Li (Massachusetts Institute of Technology)
Jacob Andreas (MIT)
David Bau (Northeastern University)
Antonio Torralba (MIT)
More from the Same Authors
-
2020 : Horses With Blue Jeans - Creating New Worlds by Rewriting a GAN »
David Bau -
2020 : Latent Compass »
Sarah Schwettmann -
2021 Spotlight: Learning to Compose Visual Relations »
Nan Liu · Shuang Li · Yilun Du · Josh Tenenbaum · Antonio Torralba -
2021 : 3D Neural Scene Representations for Visuomotor Control »
Yunzhu Li · Shuang Li · Vincent Sitzmann · Pulkit Agrawal · Antonio Torralba -
2021 : 3D Neural Scene Representations for Visuomotor Control »
Yunzhu Li · Shuang Li · Vincent Sitzmann · Pulkit Agrawal · Antonio Torralba -
2021 : Learning to Compose Visual Relations »
Nan Liu · Shuang Li · Yilun Du -
2023 : The Consensus Game: Language Model Generation via Equilibrium Search »
Athul Jacob · Yikang Shen · Gabriele Farina · Jacob Andreas -
2023 : Learning Interpretable Libraries by Compressing and Documenting Code »
Gabriel Grand · Catherine Wong · Matthew Bowers · Theo X. Olausson · Muxin Liu · Josh Tenenbaum · Jacob Andreas -
2023 : Testing Language Model Agents Safely in the Wild »
Silen Naihin · David Atkinson · Marc Green · Merwane Hamadi · Craig Swift · Douglas Schonholtz · Adam Tauman Kalai · David Bau -
2023 : Modeling Boundedly Rational Agents with Latent Inference Budgets »
Athul Jacob · Abhishek Gupta · Jacob Andreas -
2023 : An Alternative to Regulation: The Case for Public AI »
Nicholas Vincent · David Bau · Sarah Schwettmann · Joshua Tan -
2023 : MMToM-QA: Multimodal Theory of Mind Question Answering »
Chuanyang Jin · Yutong Wu · Jing Cao · Jiannan Xiang · Yen-Ling Kuo · Zhiting Hu · Tomer Ullman · Antonio Torralba · Josh Tenenbaum · Tianmin Shu -
2023 : MMToM-QA: Multimodal Theory of Mind Question Answering »
Chuanyang Jin · Yutong Wu · Jing Cao · Jiannan Xiang · Yen-Ling Kuo · Zhiting Hu · Tomer Ullman · Antonio Torralba · Josh Tenenbaum · Tianmin Shu -
2023 : Compositional Foundation Models for Hierarchical Planning »
Anurag Ajay · Seungwook Han · Yilun Du · Shuang Li · Abhi Gupta · Tommi Jaakkola · Josh Tenenbaum · Leslie Kaelbling · Akash Srivastava · Pulkit Agrawal -
2023 : Compositional Foundation Models for Hierarchical Planning »
Anurag Ajay · Seungwook Han · Yilun Du · Shuang Li · Abhi Gupta · Tommi Jaakkola · Josh Tenenbaum · Leslie Kaelbling · Akash Srivastava · Pulkit Agrawal -
2023 : Evaluating the Utility of Model Explanations for Model Development »
Shawn Im · Jacob Andreas · Yilun Zhou -
2023 : Automatic Discovery of Visual Circuits »
Achyuta Rajaram · Neil Chowdhury · Antonio Torralba · Jacob Andreas · Sarah Schwettmann -
2023 : The Consensus Game: Language Model Generation via Equilibrium Search »
Athul Jacob · Yikang Shen · Gabriele Farina · Jacob Andreas -
2023 Poster: The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks »
Ziqian Zhong · Ziming Liu · Max Tegmark · Jacob Andreas -
2023 Oral: The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks »
Ziqian Zhong · Ziming Liu · Max Tegmark · Jacob Andreas -
2023 Poster: Compositional Foundation Models for Hierarchical Planning »
Anurag Ajay · Seungwook Han · Yilun Du · Shuang Li · Abhi Gupta · Tommi Jaakkola · Josh Tenenbaum · Leslie Kaelbling · Akash Srivastava · Pulkit Agrawal -
2023 Poster: 3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes »
Haotian Xue · Antonio Torralba · Josh Tenenbaum · Dan Yamins · Yunzhu Li · Hsiao-Yu Tung -
2023 Poster: Structure from Duplicates: Neural Inverse Graphics from a Pile of Objects »
Tianhang Cheng · Wei-Chiu Ma · Kaiyu Guan · Antonio Torralba · Shenlong Wang -
2022 Workshop: LaReL: Language and Reinforcement Learning »
Laetitia Teodorescu · Laura Ruis · Tristan Karch · Cédric Colas · Paul Barde · Jelena Luketina · Athul Jacob · Pratyusha Sharma · Edward Grefenstette · Jacob Andreas · Marc-Alexandre Côté -
2022 Poster: Locating and Editing Factual Associations in GPT »
Kevin Meng · David Bau · Alex Andonian · Yonatan Belinkov -
2022 Poster: Procedural Image Programs for Representation Learning »
Manel Baradad · Richard Chen · Jonas Wulff · Tongzhou Wang · Rogerio Feris · Antonio Torralba · Phillip Isola -
2022 Poster: Learning Neural Acoustic Fields »
Andrew Luo · Yilun Du · Michael Tarr · Josh Tenenbaum · Antonio Torralba · Chuang Gan -
2022 Poster: Pre-Trained Language Models for Interactive Decision-Making »
Shuang Li · Xavier Puig · Chris Paxton · Yilun Du · Clinton Wang · Linxi Fan · Tao Chen · De-An Huang · Ekin Akyürek · Anima Anandkumar · Jacob Andreas · Igor Mordatch · Antonio Torralba · Yuke Zhu -
2022 Poster: ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment »
Joseph DelPreto · Chao Liu · Yiyue Luo · Michael Foshey · Yunzhu Li · Antonio Torralba · Wojciech Matusik · Daniela Rus -
2021 : Q/A Session »
Alice Xiang · Jacob Andreas -
2021 : [IT5] Natural language descriptions of deep features »
Jacob Andreas -
2021 : 3D Neural Scene Representations for Visuomotor Control »
Yunzhu Li · Shuang Li · Vincent Sitzmann · Pulkit Agrawal · Antonio Torralba -
2021 : Spotlights »
Hager Radi · Krishan Rana · Yunzhu Li · Shuang Li · Gal Leibovich · Guy Jacob · Ruihan Yang -
2021 Poster: Learning to Compose Visual Relations »
Nan Liu · Shuang Li · Yilun Du · Josh Tenenbaum · Antonio Torralba -
2021 Poster: Unsupervised Learning of Compositional Energy Concepts »
Yilun Du · Shuang Li · Yash Sharma · Josh Tenenbaum · Igor Mordatch -
2021 Poster: Editing a classifier by rewriting its prediction rules »
Shibani Santurkar · Dimitris Tsipras · Mahalaxmi Elango · David Bau · Antonio Torralba · Aleksander Madry -
2021 Poster: Teachable Reinforcement Learning via Advice Distillation »
Olivia Watkins · Abhishek Gupta · Trevor Darrell · Pieter Abbeel · Jacob Andreas -
2020 Poster: A Benchmark for Systematic Generalization in Grounded Language Understanding »
Laura Ruis · Jacob Andreas · Marco Baroni · Diane Bouchacourt · Brenden Lake -
2020 Poster: Compositional Explanations of Neurons »
Jesse Mu · Jacob Andreas -
2020 Oral: Compositional Explanations of Neurons »
Jesse Mu · Jacob Andreas -
2020 Poster: Compositional Visual Generation with Energy Based Models »
Yilun Du · Shuang Li · Igor Mordatch -
2020 Spotlight: Compositional Visual Generation with Energy Based Models »
Yilun Du · Shuang Li · Igor Mordatch -
2020 Poster: Debiased Contrastive Learning »
Ching-Yao Chuang · Joshua Robinson · Yen-Chen Lin · Antonio Torralba · Stefanie Jegelka -
2020 Spotlight: Debiased Contrastive Learning »
Ching-Yao Chuang · Joshua Robinson · Yen-Chen Lin · Antonio Torralba · Stefanie Jegelka -
2019 : Panel Discussion »
Jacob Andreas · Edward Gibson · Stefan Lee · Noga Zaslavsky · Jason Eisner · Jürgen Schmidhuber -
2019 : Invited Talk - 4 »
Jacob Andreas -
2018 : Panel Discussion »
Antonio Torralba · Douwe Kiela · Barbara Landau · Angeliki Lazaridou · Joyce Chai · Christopher Manning · Stevan Harnad · Roozbeh Mottaghi -
2018 : Antonio Torralba - Learning to See and Hear »
Antonio Torralba -
2018 Poster: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding »
Kexin Yi · Jiajun Wu · Chuang Gan · Antonio Torralba · Pushmeet Kohli · Josh Tenenbaum -
2018 Poster: 3D-Aware Scene Manipulation via Inverse Graphics »
Shunyu Yao · Tzu Ming Hsu · Jun-Yan Zhu · Jiajun Wu · Antonio Torralba · Bill Freeman · Josh Tenenbaum -
2018 Poster: Speaker-Follower Models for Vision-and-Language Navigation »
Daniel Fried · Ronghang Hu · Volkan Cirik · Anna Rohrbach · Jacob Andreas · Louis-Philippe Morency · Taylor Berg-Kirkpatrick · Kate Saenko · Dan Klein · Trevor Darrell -
2018 Spotlight: Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding »
Kexin Yi · Jiajun Wu · Chuang Gan · Antonio Torralba · Pushmeet Kohli · Josh Tenenbaum -
2017 : Afternoon Panel discussion »
Brian Skyrms · Satinder Singh · Jacob Andreas -
2017 : Poster session (and Coffee Break) »
Jacob Andreas · Kun Li · Conner Vercellino · Thomas Miconi · Wenpeng Zhang · Luca Franceschi · Zheng Xiong · Karim Ahmed · Laurent Itti · Tim Klinger · Mostafa Rohaninejad -
2016 Poster: Generating Videos with Scene Dynamics »
Carl Vondrick · Hamed Pirsiavash · Antonio Torralba -
2016 Poster: SoundNet: Learning Sound Representations from Unlabeled Video »
Yusuf Aytar · Carl Vondrick · Antonio Torralba -
2015 Poster: Skip-Thought Vectors »
Jamie Kiros · Yukun Zhu · Russ Salakhutdinov · Richard Zemel · Raquel Urtasun · Antonio Torralba · Sanja Fidler -
2015 Poster: Where are they looking? »
Adria Recasens · Aditya Khosla · Carl Vondrick · Antonio Torralba -
2015 Spotlight: Where are they looking? »
Adria Recasens · Aditya Khosla · Carl Vondrick · Antonio Torralba -
2015 Poster: Learning visual biases from human imagination »
Carl Vondrick · Hamed Pirsiavash · Aude Oliva · Antonio Torralba -
2015 Poster: On the Accuracy of Self-Normalized Log-Linear Models »
Jacob Andreas · Maxim Rabinovich · Michael Jordan · Dan Klein -
2014 Poster: Unsupervised Transcription of Piano Music »
Taylor Berg-Kirkpatrick · Jacob Andreas · Dan Klein -
2014 Spotlight: Unsupervised Transcription of Piano Music »
Taylor Berg-Kirkpatrick · Jacob Andreas · Dan Klein