Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

Detecting Motivated Reasoning in the Internal Representations of Language Models

Parsa Mirtaheri ⋅ Misha Belkin

Project Page [ OpenReview]

Abstract

Large language models (LLMs) sometimes produce chains-of-thought (CoT) that do not faithfully reflect their internal reasoning. In particular, a biased context with a hint can cause a model to change its answer while rationalizing the hinted option without acknowledging its reliance on the hint, a form of unfaithful motivated reasoning. We investigate this phenomenon in the Qwen2.5-7B-Instruct model on the MMLU benchmark and show that motivated reasoning can be detected in the model’s internal representations. We train non-linear probes over the model residual stream and find that the hinted option is consistently predictable from representations at the end of CoT. Focusing on cases where the model changes its output to the hint without mentioning it, we demonstrate that probes can (i) predict whether the model will follow a hint from its internal representations early in the CoT, and (ii) determine whether a hint-consistent final answer was counterfactually dependent on the hint based on internal representations at the end of CoT.

Chat is not available.