The present invention relates to computer speech recognition. More particularly, the present invention relates to a confidence measure system using a near-miss pattern or a plurality of possible words.
Speech recognition systems are generally known. During speech recognition, speech is provided as an input into the system in the form of an audible voice signal such as through a microphone. The microphone converts the audible speech signal to an analog electronic signal. An analog-to-digital converter receives the analog signal and produces a sequence of digital signals. A conventional array processor performs spectral analysis on the digital signals and computes a magnitude value for each frequency band of a frequency spectrum. In one embodiment, the digital signal received from the analog-to-digital converter is divided into frames. The frames are encoded to reflect spectral characteristics for a plurality of frequency bands. In the case of discrete and semi-continuous hidden Markov modeling, the feature vectors are encoded into one or more code words using vector quantization techniques and a code book derived from training data. Output probability distributions are then preferably computed against hidden Markov models using the feature vector (or code words) of the particular frame being analyzed. These probability distributions are later used in executing a Viterbi or similar type of processing technique. Stored acoustic models, such as hidden Markov models, a lexicon and a language model are used to determine the most likely representative word for the utterance received by the system.
While modern speech recognition systems generally produce good search results for utterances actually present in the recognition inventory, the system has no way of discarding the search results for out-of-vocabulary (OOV) input utterances that are deemed to be wrong. In such cases, use of a confidence measure as applied to the recognition results can provide assurances as to the results obtained. Confidence measures have been used in many forms of speech recognition applications, including supervised and unsupervised adaptation, recognition error rejection, out-of-vocabulary (OOV) word detection, and keyword spotting. A method that has been used for confidence modeling is the comparison of the score of the hypothesized word with the score of a xe2x80x9cfillerxe2x80x9d model. One such system is described by R. C. Rose and D. B. Paul, in xe2x80x9cA Hidden Markov Model Based Key Word Recognition System,xe2x80x9d published in IEEE International Conference on Acoustics Speech, and Signal Processing, vol. 1, pp. 129-132, 1990.
It is believed by many that the confidence measure should be based on the ratio between the recognition score and the xe2x80x9cfiller modelxe2x80x9d (usually used to model OOV (out-of-vocabulary) words) score. The xe2x80x9cfiller modelxe2x80x9d models are often one of the following two types: (1) a context independent (CI) phone network where every phone is connected to every other phone; or (2) a large context dependent vocabulary system where phone connections represent almost all the possible words in a particular language. While the context independent phone network approach is very efficient, the performance is mediocre at best because of the use of imprecise CI models. The context dependent approach can generate decent confidence measures, but suffers from two shortcomings. First, the approach considers only the ratio of the scores of the best recognized word and the best xe2x80x9cfiller-modelxe2x80x9d word. Second, due to a single ratio comparison, the requirement of building all words in the OOV network is not practical and also makes the system ineffective for rejecting noise sources other than OOV words.
A method and system of performing confidence measure in speech recognition systems includes receiving an utterance of input speech and creating a near-miss pattern or a near-miss list of possible word entries for the input utterance. Each word entry includes an associated value of probability that the utterance corresponds to the word entry. The near-miss list of possible word entries is compared with corresponding stored near-miss confidence templates. Each near-miss confidence template includes a list of word entries and each word entry in each list includes an associated value. Confidence measure for a particular hypothesis word is performed based on the comparison of the values in the near-miss list of possible word entries with the values of the corresponding near-miss confidence template.
Another aspect of the present invention is a system and method for generating word-based, near-miss confidence templates for a collection of words in a speech recognition system. Each near-miss confidence template is generated from multiple near-miss lists produced by a recognizer on multiple acoustic data for the same word. Each near-miss confidence template of the set of near-miss confidence templates includes a list of word entries having an associated probability value related to acoustic similarity.