Poster
in
Workshop: Safe Generative AI

INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF

Wannan Yang ⋅ Chen Sun ⋅ Gyorgy Buzsaki

Project Page [ OpenReview]

Abstract

Conversational large language models (LLMs) are trained to be helpful, honest and harmless (HHH) and yet they remain susceptible to hallucinations, misinformation and are capable of deception. A promising avenue for safeguarding against these behaviors is to gain a deeper understanding of their inner workings. Here we ask: what could interpretability tell us about deception and can it help to control it? First, we introduce a simple and yet general protocol to induce 20 large conversational models from different model families (Llama, Gemma, Yi and Qwen) of various sizes (from 1.5B to 70B) to knowingly lie. Second, we characterize three iterative refinement stages of deception from the latent space representation. Third, we demonstrate that these stages are \textit{universal} across models from different families and sizes. We find that the third stage progression reliably predicts whether a certain model is capable of deception. Furthermore, our patching results reveal that a surprisingly sparse set of layers and attention heads are causally responsible for lying. Importantly, consistent across all models tested, this sparse set of layers and attention heads are part of the third iterative refinement process. When contrastive activation steering is applied to control model output, only steering these layers from the third stage could effectively reduce lying. Overall, these findings identify a universal motif across deceptive models and provide actionable insights for developing general and robust safeguards against deceptive AI. The code, dataset, visualizations, and an interactive demo notebook are available at https://github.com/safellm-2024/llm_deception.

Chat is not available.