VISGate: ROI-Conditioned Dual-Head Encoders that Align Visual Features and Brain Responses
Abstract
Foundation models enable image-to-brain encoders that scale across cortical regions and subjects. We present VISGate, a frozen DINOv2 backbone coupled to a lightweight ROI Transformer decoder and dual heads that (i) regress voxel activations and (ii) predict per-ROI caption embeddings. Trained on the Natural Scenes Dataset (NSD), the model yields consistent voxel predictivity across five cortical streams (early, midventral, midlateral, ventral, lateral) while revealing systematic divergence between text-alignment and voxel-predictivity. We evaluate per-voxel correlation, split-half noise ceilings, and normalized accuracy, and we visualize semantic category–wise ROI profiles. Across multiple NSD subjects, ventral and lateral ROIs dominate normalized accuracy, whereas caption alignment peaks elsewhere, quantifying a gap between semantic alignment and neural selectivity.