Skip to yearly menu bar Skip to main content


From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Robin San Roman · Yossi Adi · Antoine Deleforge · Romain Serizel · Gabriel Synnaeve · Alexandre Defossez

Great Hall & Hall B1+B2 (level 1) #604
[ ]
Wed 13 Dec 3 p.m. PST — 5 p.m. PST


Deep generative models can generate high-fidelity audio conditioned on varioustypes of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients(MFCC)). Recently, such models have been used to synthesize audiowaveforms conditioned on highly compressed representations. Although suchmethods produce impressive results, they are prone to generate audible artifactswhen the conditioning is flawed or imperfect. An alternative modeling approach isto use diffusion models. However, these have mainly been used as speech vocoders(i.e., conditioned on mel-spectrograms) or generating relatively low samplingrate signals. In this work, we propose a high-fidelity multi-band diffusion-basedframework that generates any type of audio modality (e.g., speech, music, environmentalsounds) from low-bitrate discrete representations. At equal bit rate,the proposed approach outperforms state-of-the-art generative techniques in termsof perceptual quality. Training and evaluation code are available on the facebookresearch/audiocraft github project. Samples are available on the followinglink (

Chat is not available.