Li Deng: Seven Years of Deep Speech Recognition Revolution and Five Areas with Potential New Breakthrough
in
Workshop: End-to-end Learning for Speech and Audio Processing
Abstract
Seven years ago, NIPS Workshop: Deep Learning for Speech Recognition and Related Applications (at Whistler, BC, Canada, Dec. 2009 organized by Deng, Yu, Hinton) and the collaboration between Microsoft Research and University of Toronto surrounding that time ignited the spark of deep learning for speech recognition in academic settings. Within a short period of 18 months after the workshop, the deep neural nets technology was dramatically scaled up and successfully deployed in large-scale speech recognition applications, initially at Microsoft that was rapidly followed by IBM, Google, Apple, Nuance, Facebook, iFlytek, Baidu, etc. This spectacular development also helped create a full suite of voice recognition startups worldwide. This talk will briefly review how deep learning revolutionized the entire speech recognition field and then survey its state-of-the-art as of today. The outlook of likely future breakthrough areas, in both short and long terms, will be described and analyzed. These areas include: 1) better modeling for end-to-end and other specialized architectures capable of disentangling mixed acoustic variability factors; 2) better integrated signal processing and neural learning to combat difficult far-field acoustic environments especially with mixed speakers; 3) use of neural language understanding to model long-span dependency for semantic and syntactic consistency in speech recognition outputs, and use of semantic understanding in spoken dialogue systems (we now call them conversational AI bots) to provide feedbacks to make acoustic speech recognition easier; 4) use of naturally available multimodal “labels” such as images, printed text, and handwriting to supplement the current way of providing text labels to synchronize with the corresponding acoustic utterances; and 5) development of ground-breaking deep unsupervised learning methods not only for fast self-adaptation but more importantly for exploitation of potentially unlimited amounts of naturally found acoustic data of speech without the otherwise prohibitively high cost of labeling based on the current deep supervised learning paradigm.