Condition-Dependent Representational Alignment between Whisper and the Human Speech Network
Abstract
Representations in modern speech models often align with human brain activity, but how acoustic degradation alters this alignment remains unclear. Here, we quantify condition-sensitive model–brain correspondence between an automatic speech recognition (ASR) model and the human cortex. Twenty-five participants listened to clean and noisy (−3 dB SNR) sentences while undergoing fMRI. Layer-wise embeddings from Whisper Tiny (an encoder–decoder Transformer) were mapped to voxel time series using ridge-regularized linear encoding to obtain normalized neural predictivity. Under clean speech, alignment peaked for decoder representations in the left middle frontal gyrus (MFG), with additional encoder peaks in the right inferior frontal gyrus (IFG). Under noisy speech, peaks shifted toward encoder layers in the right Heschl’s gyrus and the right IFG pars orbitalis (IFGorb). Moreover, we observed significantly higher neural predictivity for clean than for noisy speech in the right IFG at middle and late encoder layers and in the left MFG at a middle decoder layer. These results demonstrate condition-dependent cortical alignment profiles across model layers and suggest a dynamic reweighting between feedforward acoustic encoding and top-down predictive decoding.