Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans
Abstract
Modern large language models (LLMs) achieve impressive performance on sometasks, while exhibiting distinctly non-human-like behaviors on others. This raisesthe question of how well the LLM’s learned representations align with human rep-resentations. In this work, we introduce a novel approach to study representationalignment: we adopt an activation steering method to identify neurons responsiblefor specific concepts (e.g., “cat”) and then analyze the corresponding activationpatterns. We find that LLM representations captured this way closely align withhuman representations inferred from behavioral data, matching inter-human align-ment levels. Our approach significantly outperforms the alignment captured byword/sentence embeddings, which have been the focus of prior work on human-LLM alignment. Additionally, our approach enables a more granular view of howLLMs represent concepts — we show that LLMs organize concepts in a way thatmirrors human concept organization.