On The Diversity of ASR Hypotheses In Spoken Language Understanding
Abstract
In Conversational AI, an Automatic Speech Recognition (ASR) system is used to transcribe the user's speech, and the output of the ASR is passed as an input to a Spoken Language Understanding (SLU) system, which outputs semantic objects (such as intent, slot-act pairs, etc.). Recent work, including the state-of-the-art methods in SLU utilize either Word lattices or N-Best Hypotheses from the ASR. The intuition given for using N-Best instead of 1-Best is that the hypotheses provide extra information due to errors in the transcriptions of the ASR system, i.e., the performance gain is attributed to the word-error-rate (WER) of the ASR. We empirically show that the gain in using N-Best hypotheses is loosely related to WER but related to the diversity of hypotheses.