Skip to yearly menu bar Skip to main content

Workshop: XAI in Action: Past, Present, and Future Applications

Explaining black box text modules in natural language with language models

Chandan Singh · Aliyah Hsu · Richard Antonello · Shailee Jain · Alexander Huth · Bin Yu · Jianfeng Gao

[ ] [ Project Page ]
Sat 16 Dec 12:01 p.m. PST — 1 p.m. PST


Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A text module is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. Black box indicates that we only have access to the module's inputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation. We study SASC in 2 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals.

Chat is not available.