A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
Abstract
Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, with little understanding or control over which internal mechanisms are changed. We propose a method for interpretable alignment that identifies and updates only the neurons most responsible for a given behavior. Using sparse autoencoders (SAEs) and linear probes, we isolate the ~3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. This sparse, targeted intervention requires far less data than conventional methods, avoids distributional shift, and offers direct insight into the internal circuits involved. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable, interpretable, and more principled alternative to full-model fine-tuning.