Jan Chorowski: Improving sequence to sequence models for speech recognition
in
Workshop: End-to-end Learning for Speech and Audio Processing
Abstract
Sequence to sequence neural networks for speech recognition provide an interesting alternative to HMM-based approaches. The networks combine language and acoustic modeling capabilities and can be discriminatively trained to directly maximize the probability of a transcript conditioned on the recorded utterance. Decoding of new utterances is easily accomplished using beam search. One of the challenges of their successful application is overconfidence of their predictions. It was observed that often best results are obtained with very narrow beams and that sometimes increasing the beam size deteriorated the results. I will show how to control the over-confidence of the networks and ensure good decoding results, also when using external language models.