Persona Subgraphs: Discovering and Steering Persona-Specific Circuits in Language Models via Sparse Autoencoder Features
Arul Murugan
Abstract
We present Persona Subgraphs, a methodology for discovering persona-specific computational circuits using sparse autoencoder (SAE) features and attribution graphs. Analyzing 8 diverse personas in Gemma-2-2B, we find circuit specialization: personas share only 21.3% of features on average (Jaccard similarity 0.159--0.271), with each utilizing 620--1,369 signature features that causally influence behavior. Signature features concentrate in layers 21--24, suggesting personas are high-level behavioral programs built atop lower-level linguistic processing. Steering experiments validate causality: amplifying persona-specific features with optimal strength $\lambda \in [1.0, 1.5]$ reliably transforms outputs. This compositional structure enables new approaches for behavioral guarantees and interpretable AI customization.
Chat is not available.
Successful Page Load