Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Audio

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Ge Zhu · Yutong Wen · Marc-André Carbonneau · Zhiyao Duan

[ ]
[ Slides
Sat 16 Dec 2 p.m. PST — 2:20 p.m. PST
presentation: Machine Learning for Audio
Sat 16 Dec 6:20 a.m. PST — 3:30 p.m. PST


Diffusion models have showcased their capabilities in audio synthesis ranging over a variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. It potentially introduces challenges in generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page:

Chat is not available.