Most speaker dependent speech recognition systems do not have the capability of detecting recognition errors triggered by out-of-vocabulary (OOV) words or by utterances that have been severely corrupted by environmental noise. As a result, unnecessary recognition errors may potentially frustrate a user and lower the perceived benefit of an otherwise reliable speech recognition system.
Error detection is a necessary component for speech recognition systems to improve their overall usability. For an isolated word, command and control type recognizer, three types of recognition errors can be encountered. The first type of error, called a deletion error, occurs when an input utterance is either not recognized as anything or recognized as environmental noise. In this case, the user interface should properly handle this type of error and re-prompt the user to repeat the utterance. The second type of error, insertion error, occurs when a user does not say anything but the system recognizes an incorrect word. Finally, the third type of error, substitution error, occurs when an incorrect word is recognized instead of the correct utterance. This can happen when a user either says a valid vocabulary word or inadvertently inputs an OOV utterance.
In the case of a speaker dependent recognition system, out-of-vocabulary utterances are usually input by the user when they attempt to select a word they have not enrolled, or when they have forgotten the exact utterances that were previously enrolled. Extraneous background noises or background conversations could also be confused by a speech recognizer as a valid input utterance. The resulting substitution and insertion errors can be the most disrupting, as the system might initiate an invalid action that has to be aborted then by the user. The ability to identify and properly handle these two types of recognition errors can significantly improve the overall performance of a speech recognition system. Furthermore, in the event the speech recognition system is being used in a hands-busy or eyes-busy situation, such as while driving, unnecessary system demands for user attention should be avoided.
Numerous techniques dealing with threshold based confidence measures for detecting recognition errors have been researched and implemented for isolated and continuous-type recognition systems. Confidence measures based on the results of an N-best Viterbi search have been used. While useful for identifying certain substitution errors, these techniques are not an adequate approach for identifying OOV occurrences. The increased computational complexity of a confidence measure based on an N-best search approach may be a considerable drawback when considering low-cost DSP implementations.
Prior art methods utilized in speaker independent systems for dealing with the rejection of OOV utterances are based on an explicit garbage, or filler, model trained off-line on a multi-speaker database of OOV utterances. The model can be characterized as a parametric representation of a vocabulary item in the data store of a speech recognition system. Typical representations of a model include a conventional template as used in a dynamic time warping (DTW) type recognizer, a statistical representation, as is current in a Hidden Markov Model (HMM) recognizer, or the set of weights used to characterize a multi-layer artificial neural network (ANN). In the case of the explicit garbage model, when the input utterance corresponds to an OOV item, the best match resulting from the standard Viterbi decoder corresponds to the garbage model. This methodology is usually not adequate for a speaker dependent system as a database of OOV utterances is not a priori available to train an off-line model for a particular user. Furthermore, it is not practical to request a user to provide a series of input tokens, not part of the regular user's vocabulary, for the sole purpose of training an on-line garbage model.
Yet another prior art method used initially in key-word spotting applications does not require an explicit filler model. This method is based on an average local garbage score that is calculated from the N-best scores at each time frame. A frame can be defined, for example, as a time interval over which some relevant parameters are extracted from the speech signal. This frame becomes then the time unit over which the recognizer operates. Once the decoding process is complete, a total garbage score can be computed by summing local garbage scores over the end-points of the recognized utterance. However, the disadvantage of such an approach is that the global garbage score is computed as a post-processing step, and it relies on end-points corresponding to the best vocabulary match. Such an approach also implies a perfect alignment of the garbage model with the recognized utterance, and such a forced alignment may not be as efficient and accurate as a technique relying on a separate model for handling OOV occurrences.
Accordingly, there is a need for a method of calculating a standard garbage model for the detection of OOV utterances within the framework of a speaker dependent system.