NeurIPS 2020 : Investigating Gender Bias in Language Models Using Causal Mediation Analysis



Investigating Gender Bias in Language Models Using Causal Mediation Analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, Stuart Shieber

Spotlight presentation: Orals & Spotlights Track 03: Language/Audio Applications
on Tue, Dec 8th, 2020 @ 03:10 – 03:20 GMT

Poster Session 1 (more posters)
on Tue, Dec 8th, 2020 @ 05:00 – 07:00 GMT

Toggle Abstract Paper (in Proceedings / .pdf)

Abstract: Many interpretation methods for neural models in natural language processing investigate how information is encoded inside hidden representations. However, these methods can only measure whether the information exists, not whether it is actually used by the model. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. The approach enables us to analyze the mechanisms that facilitate the flow of information from input to output through various model components, known as mediators. As a case study, we apply this methodology to analyzing gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are concentrated in specific components of the model that may exhibit highly specialized behavior.

Investigating Gender Bias in Language Models Using Causal Mediation Analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, Stuart Shieber

Preview Video and Chat

To see video, interact with the author and ask questions please use registration and login.