Poster
in
Workshop: Machine Learning for Audio

Composing and Validating Large-Scale Datasets for Training Open Foundation Models for Audio

Marianna Nezhurina ⋅ Ke Chen ⋅ Yusong Wu ⋅ Tianyu Zhang ⋅ Haohe Liu ⋅ Yuchen Hui ⋅ Taylor Berg-Kirkpatrick ⋅ Shlomo Dubnov ⋅ Jenia Jitsev

[ Poster]

Abstract

Obtaining strong reproducible foundation language-audio models require open datasets of sufficient scale and quality. To pre-train contrastive language-audio model we compose large-scale sound effects dataset with detailed text descriptions for each sample. Generating music, as a special type of audio, presents further challenges due to limited availability of music-text pairs with expressive enough captions. We show here how we combine various composed datasets to pre-train a large-scale audio-language contrastive model (CLAP). Then we train, on music samples we collected, a state-of-the-art text-to-music model, MusicLDM, that adapts AudioLDM, which is based on Stable Diffusion architecture, to the music domain, by utilizing pre-trained CLAP model and the Hifi-GAN vocoder, as components of MusicLDM. The modelling work validates thus composed text-audio and text-music datasets as strong basis for further studies on language-rooted foundation models for audio at larger scales.

Chat is not available.