Timezone: »
This workshop focuses on recent advances to end-to-end methods for speech and more general audio processing. Deep learning has transformed the state of the art in speech recognition, and audio analysis in general. In recent developments, new deep learning architectures have made it possible to integrate the entire inference process into an end-to-end system. This involves solving problems of an algorithmic nature, such as search over time alignments between different domains, and dynamic tracking of changing input conditions. Topics include automatic speech recognition systems (ASR) and other audio procssing systems that subsume front-end adaptive microphone array processing and source separation as well as back-end constructs such as phonetic context dependency, dynamic time alignment, or phoneme to grapheme modeling. Other end-to-end audio applications include speaker diarization, source separation, and music transcription. A variety of architectures have been proposed for such systems, ranging from shift-invariant convolutional pooling to connectionist temporal classification (CTC) and attention based mechanisms, or other novel dynamic components. However there has been little comparison yet in the literature of the relative merits of the different approaches. This workshop delves into questions about how different approaches handle various trade-offs in terms of modularity and integration, in terms of representation and generalization. This is an exciting new area and we expect significant interest from the machine learning and speech and audio processing communities.
Sat 12:20 a.m. - 12:30 a.m.
|
John Hershey & Philemon Brakel (Opening Remarks) link » | 🔗 |
Sat 12:30 a.m. - 1:00 a.m.
|
Oriol Vinyals: From Speech to Text and Back: Neural Sequence Models for Speech Processing
(Talk)
In my talk, I will present recent advances in neural sequence models that our group has focused on. I will describe efforts on speech synthesis using WaveNets, a model which can generate wavefroms at 16KHz, and also seq2seq models "Listen/Watch, Attend and Spell" for audio-visual speech recognition efforts for speech recognition and lip reading. |
🔗 |
Sat 1:00 a.m. - 1:30 a.m.
|
Jan Chorowski: Improving sequence to sequence models for speech recognition
(Talk)
Sequence to sequence neural networks for speech recognition provide an interesting alternative to HMM-based approaches. The networks combine language and acoustic modeling capabilities and can be discriminatively trained to directly maximize the probability of a transcript conditioned on the recorded utterance. Decoding of new utterances is easily accomplished using beam search. One of the challenges of their successful application is overconfidence of their predictions. It was observed that often best results are obtained with very narrow beams and that sometimes increasing the beam size deteriorated the results. I will show how to control the over-confidence of the networks and ensure good decoding results, also when using external language models. |
🔗 |
Sat 1:30 a.m. - 2:00 a.m.
|
Coffee Break
(Break)
|
🔗 |
Sat 2:00 a.m. - 2:30 a.m.
|
Andrew Maas: Lexicon-free conversational speech recognition by reasoning entirely at the character level
(Talk)
I will present an approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure. Our approach builds on the connectionist temporal classification (CTC) speech recognition work of Graves & Jaitly, but reasons entirely at the character level. We demonstrate our approach using the Switchboard telephone conversation transcription task and show reasoning at the character level enables natural handling of out of vocabulary words and partial word fragments. Finally, we analyze qualitative differences between the transcripts and alignments of our system compared to those of standard HMM-based recognizers. |
🔗 |
Sat 2:30 a.m. - 3:00 a.m.
|
Florian Metze: End-to-end learning for language universal speech recognition
(Talk)
link »
One of the great successes of end-to-end learning strategies such as Connectionist Temporal Classification in automatic speech recognition is the ability to train very powerful models that map directly from features to characters or context independent phones. Traditional hybrid models, or even GMMs usually require context dependent states and a Hidden Markov Model in order to achieve good performance. By contrast, with CTC, it thus becomes possible to train a multi-lingual RNN that can directly predict phones in multiple languages (multi-task training), language independent articulatory features, and language universal phones, allowing for the recognition of speech in languages for which no acoustic training data is available. |
Florian Metze 🔗 |
Sat 3:00 a.m. - 5:00 a.m.
|
Lunch
|
🔗 |
Sat 5:00 a.m. - 5:30 a.m.
|
Tara Sainath: Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition
(Talk)
Automatic Speech Recognition systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this talk, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel model. |
🔗 |
Sat 5:30 a.m. - 6:00 a.m.
|
Li Deng: Seven Years of Deep Speech Recognition Revolution and Five Areas with Potential New Breakthrough
(Talk)
Seven years ago, NIPS Workshop: Deep Learning for Speech Recognition and Related Applications (at Whistler, BC, Canada, Dec. 2009 organized by Deng, Yu, Hinton) and the collaboration between Microsoft Research and University of Toronto surrounding that time ignited the spark of deep learning for speech recognition in academic settings. Within a short period of 18 months after the workshop, the deep neural nets technology was dramatically scaled up and successfully deployed in large-scale speech recognition applications, initially at Microsoft that was rapidly followed by IBM, Google, Apple, Nuance, Facebook, iFlytek, Baidu, etc. This spectacular development also helped create a full suite of voice recognition startups worldwide. This talk will briefly review how deep learning revolutionized the entire speech recognition field and then survey its state-of-the-art as of today. The outlook of likely future breakthrough areas, in both short and long terms, will be described and analyzed. These areas include: 1) better modeling for end-to-end and other specialized architectures capable of disentangling mixed acoustic variability factors; 2) better integrated signal processing and neural learning to combat difficult far-field acoustic environments especially with mixed speakers; 3) use of neural language understanding to model long-span dependency for semantic and syntactic consistency in speech recognition outputs, and use of semantic understanding in spoken dialogue systems (we now call them conversational AI bots) to provide feedbacks to make acoustic speech recognition easier; 4) use of naturally available multimodal “labels” such as images, printed text, and handwriting to supplement the current way of providing text labels to synchronize with the corresponding acoustic utterances; and 5) development of ground-breaking deep unsupervised learning methods not only for fast self-adaptation but more importantly for exploitation of potentially unlimited amounts of naturally found acoustic data of speech without the otherwise prohibitively high cost of labeling based on the current deep supervised learning paradigm. |
🔗 |
Sat 6:30 a.m. - 8:30 a.m.
|
Spotlights and Posters
(Spotlights and Poster Session)
Spotlights: 15:30: AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech 15:40: Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition 15:50: Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation 16:00 - 16:30: Posters 16:30 - Inverted HMM - a Proof of Concept 16:40 - Wav2Letter: an End-to-End ConvNet-based Speech Recognition System 16:50 - Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition 17:00 - 17:30 Posters |
Florian Metze 🔗 |
Author Information
John Hershey (MERL)
Philemon Brakel (University of Montreal)
More from the Same Authors
-
2016 Poster: Full-Capacity Unitary Recurrent Neural Networks »
Scott Wisdom · Thomas Powers · John Hershey · Jonathan Le Roux · Les Atlas