Skip to yearly menu bar Skip to main content

Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference)

Efficient Strategies of Few-Shot On-Device Voice Cloning

Tasnima Sadekova · Vladimir Gogoryan · Ivan Vovk · Mikhail Kudinov


Recent advances in neural text-to-speech allowed to build multi-speaker systems capable of performing high-fidelity speech generation. However, it is often desirable to be able to add a new voice to a text-to-speech system based on only a few recordings. In this work, we study several approaches to the design of on-device voice cloning. Starting from a multi-speaker TTS system we improve its quality for a target speaker by fine-tuning the feature generation module on a small speech sample. We compare the performance of a feature generation module based on conventional Tacotron2 with step-wise monotonic attention with the ones based on Non-attentive Tacotron and Glow-TTS. We show that Non-attentive Tacotron significantly outperforms the attention-based model and demonstrate that a compact on-device TTS system of good quality can be obtained using only 1 minute of adaptation data with no more than 200 iterations of SGD corresponding to less than 1.5 hours of on-device training time on a consumer mobile phone.

Chat is not available.