Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

Actual or counterfactual? Asymmetric responsibility attributions in language models

Eric Bigelow ⋅ Yang Xiang ⋅ Tobias Gerstenberg ⋅ Tomer Ullman ⋅ Samuel J Gershman

Project Page [ OpenReview]

Abstract

We investigate how language models assign responsibility to collaborators. We instruct 10 large language models from three different companies to assign responsibility to agents in a collaborative task. We then compare the language models' responses to seven existing cognitive models of responsibility attribution. We find that, while humans use actual and counterfactual effort to assign responsibility to collaborators, LLMs primarily use force, and this divergence shows up asymmetrically, when evaluating collaboration failures rather than successes. Our results highlight the similarities and differences between LLMs and humans in responsibility attributions and demonstrate the promise of interpreting LLM behavior using cognitive theories.

Chat is not available.