Evaluating Multimodal Large Language Models on Core Music Perception Tasks
Abstract
Foundation models claim "musical understanding," yet most evaluations conflate listening with score reading. We adapt LogicLM to music and introduce a controlled benchmark that cleanly separates perception from reasoning across three core skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Unlike existing audio benchmarks that focus on surface-level classification, our tasks require relational understanding (recognizing rhythmic displacement, melodic invariance across keys, and harmonic intervals). In our evaluation, models act as Perceptual Formulators, generating machine-checkable symbolic schemas that deterministic solvers execute with self-refinement. Evaluating Gemini 2.5 Pro, Flash, and Qwen2.5-Omni under a 12-condition matrix reveals a critical modality gap: near-ceiling on MIDI but marked drops on audio, especially for rhythm and chords under LogicLM. Our findings expose that current systems reason well over symbols but cannot reliably "listen", a fundamental limitation for audio-first music applications.