Semitone-Aware Fourier Encoding: A Music-Structured Approach to Audio-Text Alignment
Abstract
Conventional audio-text alignment methods predominantly rely on raw spectral features, which insufficiently capture the mathematical and perceptual structures inherent to music. We introduce a representation paradigm grounded in music theory: mapping frequency spectra into the 12-tone equal temperament system—an organization consistent with the logarithmic nature of human pitch perception and widely adopted across musical cultures—followed by Fourier-based feature encoding to capture nonlinear and multi-scale acoustic patterns. This framework enhances interpretability while preserving musically salient tonal structures, robustness to noise, and improved semantic alignment with textual descriptors. Preliminary experiments indicate that such music-theory-guided representations provide a principled foundation for bridging the audio-text modality gap. We suggest this direction as a promising step toward integrating cognitive insights and domain knowledge into cross-modal representation learning.