Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

Puria Radmard · Shi Feng

Project Page [ OpenReview]

Abstract

Large language models (LLMs) can achieve high performance in next token prediction (NTP) by performing contextual inference: inferring information about the generative process underlying text, and integrating it into predictions. When engaging in conversation by autoregressively sampling the most likely tokens of a simulated assistant's response, this process constitutes the assistant's persona. Post-training methods such as reinforcement learning from human feedback aim to constrain the persona of this simulacrum to be helpful and harmless. Yet this persona is also influenced by a drive for self-consistency; LLMs will act on personas consistent with behaviour displayed in their context. We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relationship to the target misalignment behaviours, and each answer providing only one bit of information. By matching these questions and only differing binary answers across transmitted personas, we isolate the effects of contextual persona inference and self-consistency from subliminal learning from token entanglement during training.

Chat is not available.