The present invention relates to digital signal processing, and more particularly to automatic speech recognition.
The last few decades have seen the rising use of hidden Markov models (HMM) in automatic speech recognition (ASR) in various applications, such as hands-free dialing for cellphones. In this case a user orally tells the cellphone to call a name, and the cellphone's ASR recognizes the utterance as a call command plus a name in its telephone directory. Of course, utterance-level verification of the recognized items is a necessary component for such real-world ASR applications. By computing an utterance-level confidence score or confidence measure (CM), an utterance verification (UV) system can measure how well the recognition hypothesis matches the observed utterance data. Utterances whose confidence scores fall below a pre-determined threshold are rejected. Ideally, misrecognized in-vocabulary (IV) and out-of-vocabulary (OOV) phrases would be rejected by the utterance verification mechanism.
Various methods have been proposed for computing confidence measures, and these methods fall into three categories. The first category is based on some predictor features, such as acoustic score per frame and HMM state duration. An ideal predictor feature should provide strong information to separate the correctly recognized utterances from other misrecognition utterances, and the distribution overlap between the two classes should be small. However, so far, none of the predictor features is ideal in this sense.
The other two categories of confidence measures are based on statistical frameworks: a posteriori probability or hypothesis testing. Confidence measures in the category of a posteriori probability use a model to compute a score to normalize the likelihood of a recognition result, so that the normalized likelihood approximates the a posteriori probability of the recognized sequence given the observation sequence.
Confidence measures which are based on hypothesis testing use likelihood ratio testing between one model and its competing model (e.g., an anti-model or a filler model). The framework of hypothesis testing is flexible enough to incorporate a variety of methods, such as discriminative training, that may progressively improve performance of CM computation. For example, see Rahim et al. Discriminative Utterance Verification for Connected Digits Recognition, 5 IEEE Tran. Speech and Audio Processing 266 (May 1997).
Most of the foregoing methods can generate accurate confidence measures. However, for mobile devices, confidence measures face some unique challenges. Mobile devices are portable. They may be used everywhere, but they usually have limited memory and computational resources. Hence, for mobile applications, confidence measures have to be robust to noise distortion, which frequently occurs in real environments. Moreover, they have to be efficiently computed because of restrictions on computational cost. These problems may be addressed in the following two aspects:                At the recognizer level, robustness to noise can be enhanced by introducing more robust model adaptation and speech enhancement methods.        At the verification level, inherent features that are robust to noise distortion may be employed; efficient methods to extract such features may be generated; and better model training methods may be applied.        