Poster
in
Workshop: CogInterp: Interpreting Cognition in Deep Learning Models

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

Claire O'Brien · Jessica Seto · Dipankar Roy · Aditya Dwivedi · Ryan Lagasse · Sunishchal Dev · Kevin Zhu · Sean O'Brien

Project Page [ OpenReview]

Abstract

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, with little understanding or control over which internal mechanisms are changed. We propose a method for interpretable alignment that identifies and updates only the neurons most responsible for a given behavior. Using sparse autoencoders (SAEs) and linear probes, we isolate the ~3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. This sparse, targeted intervention requires far less data than conventional methods, avoids distributional shift, and offers direct insight into the internal circuits involved. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable, interpretable, and more principled alternative to full-model fine-tuning.

Chat is not available.