`

Timezone: »

 
Multilingual Spoken Words Corpus
Mark Mazumder · Sharad Chitlangia · Colby Banbury · Yiping Kang · Juan Ciro · Keith Achorn · Daniel Galvez · Mark Sabini · Peter Mattson · David Kanter · Greg Diamos · Pete Warden · Josh Meyer · Vijay Janapa Reddi

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. We generate this dataset by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset. We provide a detailed analysis of the contents of the data and contribute methods for detecting potential outliers. We report baseline accuracy metrics on keyword spotting models trained from our dataset compared to models trained on a manually-recorded keyword dataset. We conclude with our plans for dataset maintenance, updates, and open-sourced code.

Author Information

Mark Mazumder (Harvard University)
Sharad Chitlangia (BITS Pilani)
Colby Banbury (Harvard University)
Yiping Kang
Juan Ciro (Harvard University)
Keith Achorn (Intel)
Daniel Galvez (NVIDIA)
Mark Sabini
Peter Mattson (Google)

Leads ML Performance Metrics team at Google Brain. General Chair of MLPerf. Ph.D. Stanford University.

David Kanter (MLCommons)
Greg Diamos (Landing AI)
Pete Warden (Google)
Josh Meyer
Vijay Janapa Reddi (Harvard University)

More from the Same Authors