Principled probing of foundation models in the auditory modality
Abstract
We leverage ecological theories of sound perception in humans and a carefully designed dataset of perceptually calibrated sounds to develop and carry out principled fine-grained probing of foundation models in relation to the auditory modality. We show that internal activations of the state-of-the-art audio foundation model BEATs correlate better with perceptual dimensions than a supervised audio classification model and a text-audio multimodal model and that all models fail to represent at least one perceptual dimension. We also report preliminary evidence suggesting that directions aligning invariantly with a perceptual dimension can be identified within the representation space at inner layers of the BEATs model. We briefly discuss future work and potential applications.