Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.
Jaehyeon Kim (Kakao Enterprise)
Sungwon Kim (Seoul National University)
Jungil Kong (Kakao Enterprise)
Sungroh Yoon (Seoul National University)
Dr. Sungroh Yoon is Associate Professor of Electrical and Computer Engineering at Seoul National University, Korea. Prof. Yoon received the B.S. degree from Seoul National University, South Korea, and the M.S. and Ph.D. degrees from Stanford University, CA, respectively, all in electrical engineering. He held research positions with Stanford University, CA, Intel Corporation, Santa Clara, CA, and Synopsys, Inc., Mountain View, CA. He was an Assistant Professor with the School of Electrical Engineering, Korea University, from 2007 to 2012. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, Seoul National University, South Korea. Prof. Yoon is the recipient of 2013 IEEE/IEIE Joint Award for Young IT Engineers. His research interests include deep learning, machine learning, data-driven artificial intelligence, and large-scale applications including biomedicine.
Related Events (a corresponding poster, oral, or spotlight)
2020 Poster: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search »
Tue Dec 8th 05:00 -- 07:00 AM Room Poster Session 0
More from the Same Authors
2020 Poster: NanoFlow: Scalable Normalizing Flows with Sublinear Parameter Complexity »
Sang-gil Lee · Sungwon Kim · Sungroh Yoon
2017 Poster: Deep Recurrent Neural Network-Based Identification of Precursor microRNAs »
Seunghyun Park · Seonwoo Min · Hyun-Soo Choi · Sungroh Yoon
2016 Poster: Neural Universal Discrete Denoiser »
Taesup Moon · Seonwoo Min · Byunghan Lee · Sungroh Yoon