Timezone: »
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.
Author Information
Abhinav Kumar (Microsoft Research, India)
Chenhao Tan (University of Chicago)
Amit Sharma (Microsoft Research)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Probing Classifiers are Unreliable for Concept Removal and Detection »
Dates n/a. Room
More from the Same Authors
-
2022 : Pragmatic AI Explanations »
Shi Feng · Chenhao Tan -
2022 : Using Interventions to Improve Out-of-Distribution Generalization of Text-Matching Systems »
Parikshit Bansal · Yashoteja Prabhu · Emre Kiciman · Amit Sharma -
2022 : Using Interventions to Improve Out-of-Distribution Generalization of Text-Matching Systems »
Parikshit Bansal · Yashoteja Prabhu · Emre Kiciman · Amit Sharma -
2022 : A Causal AI Suite for Decision-Making »
Emre Kiciman · Eleanor Dillon · Darren Edge · Adam Foster · Joel Jennings · Chao Ma · Robert Ness · Nick Pawlowski · Amit Sharma · Cheng Zhang -
2022 : Deep End-to-end Causal Inference »
Tomas Geffner · Javier AntorĂ¡n · Adam Foster · Wenbo Gong · Chao Ma · Emre Kiciman · Amit Sharma · Angus Lamb · Martin Kukla · Nick Pawlowski · Miltiadis Allamanis · Cheng Zhang -
2022 : Counterfactual Generation Under Confounding »
Abbavaram Gowtham Reddy · Saloni Dash · Amit Sharma · Vineeth N Balasubramanian -
2022 : The Counterfactual-Shapley Value: Attributing Change in System Metrics »
Amit Sharma · Hua Li · Jian Jiao -
2022 Spotlight: Lightning Talks 1B-1 »
Qitian Wu · Runlin Lei · Rongqin Chen · Luca Pinchetti · Yangze Zhou · Abhinav Kumar · Hans Hao-Hsun Hsu · Wentao Zhao · Chenhao Tan · Zhen Wang · Shenghui Zhang · Yuesong Shen · Tommaso Salvatori · Gitta Kutyniok · Zenan Li · Amit Sharma · Leong Hou U · Yordan Yordanov · Christian Tomani · Bruno Ribeiro · Yaliang Li · David P Wipf · Daniel Cremers · Bolin Ding · Beren Millidge · Ye Li · Yuhang Song · Junchi Yan · Zhewei Wei · Thomas Lukasiewicz