NeurIPS Localizing Lying in Llama: Experiments in Prompting, Probing, and Patching

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)

Localizing Lying in Llama: Experiments in Prompting, Probing, and Patching

James Campbell · Phillip Guo · Richard Ren

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether undesirable outputs are due to a lack of knowledge or dishonesty. In this paper, we conduct an extensive study of intentional dishonesty in Llama-2-70b-chat by engineering prompts that instruct it to lie and then use mechanistic interpretability approaches to localize where in the network this lying behavior occurs. We consistently find five layers in the model that are highly important for lying using three independent methodologies (probing, patching, and concept erasure). We then successfully perform causal interventions on only 46 attention heads (or less than 1\% of all heads in the network), causing the lying model to act honestly. These interventions work robustly across four prompts and six dataset splits. We hope our work can help understand and thus prevent lying behavior in LLMs.

Chat is not available.

Poster in Workshop: Socially Responsible Language Modelling Research (SoLaR)

Localizing Lying in Llama: Experiments in Prompting, Probing, and Patching

James Campbell · Phillip Guo · Richard Ren

Poster
in
Workshop: Socially Responsible Language Modelling Research (SoLaR)