Priors in Time: A Generative View of Sparse Autoencoders for Sequential Representations
Abstract
Sparse Autoencoders (SAEs) are widely used to decompose neural network representations into interpretable concepts. Despite their success, SAEs often fail to capture all relevant concepts, raising the question of what assumptions underlie their limitations. We show that these challenges arise from a mismatch between the true data distribution and the implicit priors encoded in SAE architectures and sparsity regularizers. Taking language model representations as a case study, we demonstrate that these activations exhibit rich temporal structure—such as systematic growth in concept dimensionality, context-dependent correlations, and non-stationarity over time—that conflicts with SAE priors. Through experiments, we highlight how this mismatch leads to characteristic SAE pathologies, including degraded concept recovery and reconstruction quality over time. Our results point toward the need for new SAE designs that incorporate inductive biases aligned with temporal dynamics in sequential data.