Skip to yearly menu bar Skip to main content

Affinity Workshop: Women in Machine Learning

Transformers for Synthesized Speech Detection

Emily Bartusiak


As voice synthesis systems and deep learning tools continue to improve, so does the possibility that synthesized speech can be used for nefarious purposes. We need methods that can analyze an audio signal and determine if it is synthesized. In this paper, we investigate three transformers for synthesized speech detection: Compact Convolutional Transformer (CCT), Patchout faSt Spectrogram Transformer (PaSST), and Self-Supervised Audio Spectrogram Transformer (SSAST). We show that each transformer independently detects synthesized speech well. Finally, we explore pretraining a transformer on a large-scale audio classification dataset and finetuning it for synthesized speech detection. We demonstrate that pretraining on a large dataset of audio signals that includes both speech and non-speech signals (such as music and animal noises) can improve synthesized speech detection. Evaluated on the ASVspoof2019 dataset, our approach successfully detects synthesized speech and achieves 92% or higher for all metrics considered.

Chat is not available.