Timezone: »
Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.
Author Information
Alexey Gritsenko (Google Research)
Tim Salimans (Google Brain Amsterdam)
Rianne van den Berg (Google Brain)
Jasper Snoek (Google Research, Brain team)
Jasper Snoek is a research scientist at Google Brain. His research has touched a variety of topics at the intersection of Bayesian methods and deep learning. He completed his PhD in machine learning at the University of Toronto. He subsequently held postdoctoral fellowships at the University of Toronto, under Geoffrey Hinton and Ruslan Salakhutdinov, and at the Harvard Center for Research on Computation and Society, under Ryan Adams. Jasper co-founded a Bayesian optimization focused startup, Whetlab, which was acquired by Twitter. He has served as an Area Chair for NeurIPS, ICML, AISTATS and ICLR, and organized a variety of workshops at ICML and NeurIPS.
Nal Kalchbrenner (Google Brain)
More from the Same Authors
-
2021 : Palette: Image-to-Image Diffusion Models »
Chitwan Saharia · William Chan · Huiwen Chang · Chris Lee · Jonathan Ho · Tim Salimans · David Fleet · Mohammad Norouzi -
2021 : Classifier-Free Diffusion Guidance »
Jonathan Ho · Tim Salimans -
2021 : Classifier-Free Diffusion Guidance »
Jonathan Ho · Tim Salimans -
2021 : Palette: Image-to-Image Diffusion Models »
Chitwan Saharia · William Chan · Huiwen Chang · Chris Lee · Jonathan Ho · Tim Salimans · David Fleet · Mohammad Norouzi -
2022 : Protein structure generation via folding diffusion »
Kevin Wu · Kevin Yang · Rianne van den Berg · James Zou · Alex X Lu · Ava Soleimany -
2022 : On Distillation of Guided Diffusion Models »
Chenlin Meng · Ruiqi Gao · Diederik Kingma · Stefano Ermon · Jonathan Ho · Tim Salimans -
2022 : Panel »
Guy Van den Broeck · Cassio de Campos · Denis Maua · Kristian Kersting · Rianne van den Berg -
2022 Workshop: Machine Learning and the Physical Sciences »
Atilim Gunes Baydin · Adji Bousso Dieng · Emine Kucukbenli · Gilles Louppe · Siddharth Mishra-Sharma · Benjamin Nachman · Brian Nord · Savannah Thais · Anima Anandkumar · Kyle Cranmer · Lenka Zdeborová · Rianne van den Berg -
2022 : Panel Discussion »
Jacob Gardner · Marta Blangiardo · Viacheslav Borovitskiy · Jasper Snoek · Paula Moraga · Carolina Osorio -
2022 : Invited Talk: Jasper Snoek »
Jasper Snoek -
2022 Poster: Video Diffusion Models »
Jonathan Ho · Tim Salimans · Alexey Gritsenko · William Chan · Mohammad Norouzi · David Fleet -
2022 Poster: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding »
Chitwan Saharia · William Chan · Saurabh Saxena · Lala Li · Jay Whang · Emily Denton · Kamyar Ghasemipour · Raphael Gontijo Lopes · Burcu Karagol Ayan · Tim Salimans · Jonathan Ho · David Fleet · Mohammad Norouzi -
2021 : Invited Talk #3: Rianne van den Berg »
Rianne van den Berg -
2021 Poster: Structured Denoising Diffusion Models in Discrete State-Spaces »
Jacob Austin · Daniel D. Johnson · Jonathan Ho · Daniel Tarlow · Rianne van den Berg -
2021 Poster: Variational Diffusion Models »
Diederik Kingma · Tim Salimans · Ben Poole · Jonathan Ho -
2020 Poster: Hyperparameter Ensembles for Robustness and Uncertainty Quantification »
Florian Wenzel · Jasper Snoek · Dustin Tran · Rodolphe Jenatton -
2020 Tutorial: (Track2) Practical Uncertainty Estimation and Out-of-Distribution Robustness in Deep Learning Q&A »
Dustin Tran · Balaji Lakshminarayanan · Jasper Snoek -
2020 Session: Orals & Spotlights Track 19: Probabilistic/Causality »
Julie Josse · Jasper Snoek -
2020 Tutorial: (Track2) Practical Uncertainty Estimation and Out-of-Distribution Robustness in Deep Learning »
Dustin Tran · Balaji Lakshminarayanan · Jasper Snoek -
2019 Workshop: Graph Representation Learning »
Will Hamilton · Rianne van den Berg · Michael Bronstein · Stefanie Jegelka · Thomas Kipf · Jure Leskovec · Renjie Liao · Yizhou Sun · Petar Veličković -
2019 Poster: Integer Discrete Flows and Lossless Compression »
Emiel Hoogeboom · Jorn Peters · Rianne van den Berg · Max Welling -
2019 Poster: Likelihood Ratios for Out-of-Distribution Detection »
Jie Ren · Peter Liu · Emily Fertig · Jasper Snoek · Ryan Poplin · Mark Depristo · Joshua Dillon · Balaji Lakshminarayanan -
2019 Poster: DppNet: Approximating Determinantal Point Processes with Deep Networks »
Zelda Mariet · Yaniv Ovadia · Jasper Snoek -
2017 : TBD 2.5 »
Nal Kalchbrenner -
2016 Poster: Conditional Image Generation with PixelCNN Decoders »
Aaron van den Oord · Nal Kalchbrenner · Lasse Espeholt · koray kavukcuoglu · Oriol Vinyals · Alex Graves