An automatic speech recognition (ASR) system determines a semantic meaning of input speech. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input speech frames to a database of statistical models to find the models that best match the speech feature characteristics and determine a corresponding representative text or semantic meaning associated with the models. Modern statistical models are state sequence models such as hidden Markov models (HMMs) that model speech sounds (usually phonemes) using mixtures of Gaussian distributions. Often these statistical models represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the statistical models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or a list of several hypotheses, referred to as an N-best list. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
Confidence scores can be used to characterize the degree of correspondence between a given model sequence and the speech input. FIG. 1 shows a scale of confidence scores along a vertical axis ranging from a high of 1000 to a low of 0. Typically, speech recognition outputs having a confidence score above a given accept threshold are automatically accepted as probably correctly recognized. And speech recognition outputs having a confidence score below a given reject threshold are automatically rejected as probably not correctly recognized. Speech recognition outputs between the two confidence score thresholds may or may not be correctly recognized and usually require some form of user confirmation.
Confidence scores are widely used in automated dialog systems, but to date their use in dictation tasks has been rather limited. At present, confidence scoring in dictation applications has been confined to identifying incorrectly recognized words or in multi-pass recognition.