1. Technical Field
The present disclosure relates to automatic speech recognition (ASR) and more specifically to estimating correctness in ASR N-Best lists.
2. Introduction
Despite years of research, ASR technology is far from perfect and recognition errors are common, especially in uncontrolled environments. Spoken dialog systems rely on speech recognition for input from users, and recognition errors lead to misunderstandings that lengthen conversations, reduce task completion, and decrease customer satisfaction. To help identify errors, speech recognizers output confidence scores. Confidence scores indicate the reliability of the top hypothesis. When the confidence score is high, the ASR system can assume that the top hypothesis in an N-best list of hypotheses is more reliable. However, confidence scores have three intrinsic problems. First, it is difficult to set a good threshold for when to accept or reject a speech recognition result because the score itself doesn't have a clearly defined meaning. For example, a score of 50 on one grammar might indicate high reliability, but a score of 50 on another grammar might indicate very low reliability. Setting the threshold requires a trial-and-error process of carefully tuning each grammar. The second problem is that confidence scores are typically based on a limited set of features. These features typically include various measures of how well the audio in the speech matches the acoustic and language models. However current ASR systems and ways of generating confidence scores ignore a number of other potentially useful features.
Further, ASR systems assign a confidence score probability only to the most likely recognition hypothesis, yet the speech recognition engine yields many speech recognition hypotheses, perhaps 100 or more, in a list called the N-Best list. When the top hypothesis is not correct, the N-Best list often contains the correct answer further down, yet the confidence score does not communicate anything about the reliability of the items on the N-Best list.
ASR systems can output a related measurement called a word confusion network (WCN). A WCN assigns probabilities to alternate word hypotheses based on how well the word and its audio match the language and acoustic models. WCNs can assign probabilities to each item in the N-Best list. However, WCNs have two important limitations. First, WCNs do not explicitly account for the probability that the correct answer is nowhere on the list. Second, as with a confidence score, WCNs are based on a limited set of features.