Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

Zero-shot audio captioning with audio-language model guidance and audio context keywords

Leonard Salewski · Stefan Fauth · A. Sophia Koepke · Zeynep Akata


Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or those produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose a novel approach for understanding and summarising such general audio signals in a text caption. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text, guided by a pre-trained audio-language model that steers the text generation to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Code will be released upon acceptance.

Chat is not available.