Neuron-Level Linguistic Selectivity in LLMs via Minimal Pairs: A Neuroscience-Inspired Framework
Abstract
We introduce a probe-free, neuron-level methodology for localizing linguistic patterns in large language models (LLMs). Leveraging minimal pairs from BLiMP and COMPS, we compute for every neuron in every layer a minimal-pair neuron separability score, which quantifies how reliably a neuron distinguishes the acceptable (positive) from the unacceptable (negative) member of each pair. Concretely, we extract last-token activations for positive and negative sentences in each paradigm, form paired activation vectors, and define a correlation-based metric. Averaging scores across neurons produces a layer-wise curve that reveals where linguistic distinctions are concentrated; neurons from peak layers are subsequently clustered using hierarchical linkage to expose interpretable groups aligned with targeted linguistic phenomena. Unlike standard probing approaches, our method avoids training auxiliary classifiers and their associated confounds. Applying the framework to the recent Qwen3 model, we observe strong early-layer peaks for syntax and morphology, and mid-to-late-layer peaks for syntax–semantics interfaces and conceptual understanding—patterns consistent with findings from previous literature. Clusters of the most sensitive neurons further reveal both domain-general and domain-specific functional organization within LLMs. The cognitive neuroscience-inspired method is straightforward to implement yet yields fine-grained neuron-level maps of linguistic selectivity, offering a powerful tool for targeted interventions and mechanistic analysis.