For decades, the general architecture of the classical state-of-the-art statistical approach to automatic speech recognition (ASR) has not been significantly challenged. The classical statistical approach to ASR is based on Bayes decision rule, a separation of acoustic and language modeling, hidden Markov modeling (HMM), and a search organization based on dynamic programming and hypothesis pruning methods. Even when deep neural networks started to considerably boost ASR performance, the general architecture of state-of-the-art ASR systems was not altered considerably. The hybrid DNN/HMM approach, together with recurrent LSTM neural network language modeling currently marks the state-of-the-art on many tasks covering a large range of training set sizes. However, currently more and more alternative approaches occur, moving gradually towards so-called end-to-end approaches. By and by, these novel end-to-end approaches replace explicit time alignment modeling and dedicated search space organization by more implicit, integrated neural-network based representations, also dropping the separation between acoustic and language modeling, showing promising results, especially for large training sets.
In this presentation, an overview of current approaches to ASR will be given, including variations of both HMM-based and end-to-end modeling. Approaches will be discussed w.r.t. their modeling, their performance against available training data, their search space complexity and control, as well as potential modes of comparative analysis.
Ralf Schlüter (RWTH Aachen University)
More from the Same Authors
2018 : Panel Discussion »
Rich Caruana · Mike Schuster · Ralf Schlüter · Hynek Hermansky · Renato De Mori · Samy Bengio · Michiel Bacchiani · Jason Eisner