Can sparse autoencoders be used to decompose and interpret steering vectors?
Abstract
Steering vectors offer a promising approach to controlling large language models, demonstrating potential in regulating sycophancy, harmlessness, and refusal. However, their underlying mechanisms are not well understood. One natural approach is to use sparse autoencoders (SAEs), yet recent work found that the resulting decompositions were peculiar, with reconstructed vectors often lacking the steering properties of the original vectors. This paper explores why applying SAEs to steering vectors leads to misleading decompositions, identifying two issues: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs cannot accommodate. These issues fundamentally limit the direct use of SAEs to interpret steering vectors.