Timezone: »
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. 2) To further compress the model size and memory footprint, we introduce the grouped parameter sharing mechanism to the affine coupling layers in the post-net. 3) To improve the expressiveness of synthesized speech and reduce the dependency on accurate fine-grained alignment between text and speech, we propose a linguistic encoder with mixture alignment combining hard word-level alignment and soft phoneme-level alignment, which explicitly extracts word-level semantic information. Experimental results show that PortaSpeech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with FastSpeech 2). Our extensive ablation studies demonstrate that each design in PortaSpeech is effective.
Author Information
Yi Ren (Zhejiang University)
Jinglin Liu (Zhejiang University)
Zhou Zhao (Zhejiang University)
More from the Same Authors
-
2022 Poster: GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech »
Rongjie Huang · Yi Ren · Jinglin Liu · Chenye Cui · Zhou Zhao -
2022 Poster: Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization »
Yang Zhao · Chen Zhang · Haifeng Huang · Haoyuan Li · Zhou Zhao -
2022 Poster: Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech »
Ziyue Jiang · Zhe Su · Zhou Zhao · Qian Yang · Yi Ren · Jinglin Liu · 振辉 叶 -
2022 Poster: M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus »
Lichao Zhang · Ruiqi Li · Shoutong Wang · Liqun Deng · Jinglin Liu · Yi Ren · Jinzheng He · Rongjie Huang · Jieming Zhu · Xiao Chen · Zhou Zhao -
2023 Poster: Achieving Cross Modal Generalization with Multimodal Unified Representation »
Yan Xia · Hai Huang · Jieming Zhu · Zhou Zhao -
2023 Poster: Connecting Multi-modal Contrastive Representations »
Zehan Wang · Yang Zhao · Xize 成 · Haifeng Huang · Jiageng Liu · Aoxiong Yin · Li Tang · Linjun Li · Yongqi Wang · Ziang Zhang · Zhou Zhao -
2023 Poster: Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks »
Haoyi Duan · Yan Xia · Zhou Mingze · Li Tang · Jieming Zhu · Zhou Zhao -
2023 Poster: PTADisc: A Diverse, Immense, Student-Centered and Cross-Course Dataset for Personalized Learning »
LIya Hu · Zhiang Dong · Jingyuan Chen · Guifeng Wang · Zhihua Wang · Zhou Zhao · Fei Wu -
2022 Spotlight: Lightning Talks 4B-4 »
Ziyue Jiang · Zeeshan Khan · Yuxiang Yang · Chenze Shao · Yichong Leng · Zehao Yu · Wenguan Wang · Xian Liu · Zehua Chen · Yang Feng · Qianyi Wu · James Liang · C.V. Jawahar · Junjie Yang · Zhe Su · Songyou Peng · Yufei Xu · Junliang Guo · Michael Niemeyer · Hang Zhou · Zhou Zhao · Makarand Tapaswi · Dongfang Liu · Qian Yang · Torsten Sattler · Yuanqi Du · Haohe Liu · Jing Zhang · Andreas Geiger · Yi Ren · Long Lan · Jiawei Chen · Wayne Wu · Dahua Lin · Dacheng Tao · Xu Tan · Jinglin Liu · Ziwei Liu · 振辉 叶 · Danilo Mandic · Lei He · Xiangyang Li · Tao Qin · sheng zhao · Tie-Yan Liu -
2022 Spotlight: Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech »
Ziyue Jiang · Zhe Su · Zhou Zhao · Qian Yang · Yi Ren · Jinglin Liu · 振辉 叶 -
2022 Spotlight: GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech »
Rongjie Huang · Yi Ren · Jinglin Liu · Chenye Cui · Zhou Zhao -
2022 Spotlight: M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus »
Lichao Zhang · Ruiqi Li · Shoutong Wang · Liqun Deng · Jinglin Liu · Yi Ren · Jinzheng He · Rongjie Huang · Jieming Zhu · Xiao Chen · Zhou Zhao -
2022 Poster: Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models »
Zijian Zhang · Zhou Zhao · Zhijie Lin -
2021 Poster: Generalizable Multi-linear Attention Network »
Tao Jin · Zhou Zhao -
2020 Poster: Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding »
Zhu Zhang · Zhou Zhao · Zhijie Lin · jieming zhu · Xiuqiang He -
2019 Poster: FastSpeech: Fast, Robust and Controllable Text to Speech »
Yi Ren · Yangjun Ruan · Xu Tan · Tao Qin · Sheng Zhao · Zhou Zhao · Tie-Yan Liu -
2018 Poster: MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models »
Boyuan Pan · Yazheng Yang · Hao Li · Zhou Zhao · Yueting Zhuang · Deng Cai · Xiaofei He