Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

MusT3: Unified Multi-Task Model for Fine-Grained Music Understanding

Martin Kukla · Minz Won · Yun-Ning Hung · Duc Le


Recent advances in sequence-to-sequence modelling enabled new powerful multi-task models in text, vision, and speech domains. This work attempts to leverage these advances for music. We propose MusT3: Music-To-Tags Transformer, a novel model for fine-grained music understanding. First, we design the unified music-to-tags form, which enables us to cast any music understanding task as sequence prediction problem. Second, we utilize Transformer-based model to predict that sequence given music representation. Third, we leverage multi-task learning framework to train a single model for many tasks. We validate our approach on four tasks: beat tracking, chord recognition, key detection, and vocal melody extraction. Our model performs significantly better than the current state-of-the-art models on two of these tasks, while staying competitive on the remaining two. Finally, in controlled experiment, we demonstrate that our model can reuse knowledge between tasks, leading to better performance on low-resource tasks with limited training data.

Chat is not available.