This invention relates to a method for estimating a confidence measure for a speech recognition system, especially, though not exclusively, for a variable vocabulary speech recognition system.
A fixed vocabulary speech recognition system, whether large vocabulary or limited vocabulary is one where the word that a speaker can utter is, in general, known in advance, whereas a variable vocabulary speech recognition system is one where the word or words that the speaker can utter are not known in advance, that is before the system is sold to the user. An example of a variable vocabulary speech recognition system is nametag voice dialing, where the names that the user will choose to store for voice dialing are completely unknown to the manufacturer.
As speech recognition systems are deployed in ever increasing numbers and situations, they need to be sufficiently flexible to cope with a wide range of user responses and behavior, including, for example, heavy accents, hesitations, pauses within words, false starts and other sounds such as xe2x80x9cumm""sxe2x80x9d and xe2x80x9cah""sxe2x80x9d. Other extraneous sounds, such as lip smacks and heavy breathing sounds, must also be taken into consideration, as well as environmental noises, especially background noises, such as talking, loud music, door closings and road noise in a car environment.
Another common problem is users speaking words that do not belong to the speech recognition system""s vocabulary. This is commonly called an Out-Of-Vocabulary (OOV) problem. Without a verification strategy, a speech recognition system may choose the most likely pre-trained model as the recognition result.
Therefore, the speech recognition system must be able to determine whether the word that it chooses as being the most likely to be the word that the speaker actually uttered is truly xe2x80x9ccorrectxe2x80x9d or not. This xe2x80x9ccorrectnessxe2x80x9d problem can be stated as a correspondence question between the output of the recognition system and the actual input utterance. The correspondence rule is specified by the requirements of the particular application, for example, a similar correspondence rule would not generally be used for both isolated word recognition and information retrieval systems, since in the former the correspondence should be between words, whereas for the latter the correspondence should be between meanings, i.e. by key word spotting. It is, of course highly desirable to maximize this correspondence, but, even with much effort being put into such systems to do so, nevertheless, every time a recognized word sequence is considered, there is, inherently, some degree of uncertainty regarding its xe2x80x9ccorrectnessxe2x80x9d. Therefore, it is desirable to build up a confidence measure of how close the recognized word sequence is to the input utterance, so that the recognition output can be considered as xe2x80x9ccorrectxe2x80x9d or xe2x80x9cincorrectxe2x80x9d. The majority of incorrect recognitions by speech recognition systems are caused by the kinds of background noises mentioned above. A reliable speech recognition system should favor rejection of such improper recognition over an incorrect recognition result. After the rejection, the system should advise the user of the cause of the failure, and prompt the user to try again. It should be emphasized that the rejection of valid, but wrongly recognized keywords is very useful in many applications where the cost of misrecognition far exceeds the cost of rejection. For example, in voice tag recognition systems, dialing a wrong number should be avoided by rejecting a less confidently recognized result.
A typical confidence measure (CM) is a number between 0 and 1 which indicates the probability that the underlying word or utterance is recognized correctly. A value of CM=1 indicates that the system has perfect knowledge of which words are correct, whereas a value of CM=0 indicates that the system""s recognition output is highly unreliable. An accurate determination of the confidence measure is therefore very useful in such systems to enable their outputs to be correctly interpreted as being likely to be correct or not.
Traditionally, a confidence measure has been calculated using one or a combination of garbage models and anti-keyword models. Garbage models and anti-keyword models often play an important role in the CM estimation, and they work very well in speaker-independent fixed-vocabulary speech recognition systems. The garbage models are normally trained by using a very large speech data collection which excludes the within-vocabulary words. The anti-keyword models are trained by using misrecognized speech utterances. To train the two types of models, prior knowledge of which words are included in the system""s vocabulary is needed. However, in a variable-vocabulary speech recognition system, such as nametag voice dialing, such prior knowledge in unavailable. Therefore these well known types of models cannot be used to verify the recognition results.
The present invention therefore seeks to provide a method for estimating a confidence measure for a speech recognition system which improves, at least in some cases on the methods of the prior art.
Accordingly, in a first aspect, the invention provides a method of estimating a confidence measure for a speech recognition system, the method comprising the steps of receiving an input utterance, comparing the input utterance with a plurality of predetermined models of possible utterances to provide a plurality of scores indicating a degree of similarity between the input utterance and the plurality of predetermined models, determining a variance of a predetermined number of the plurality of scores, and normalizing the variance to provide a confidence measure for the input utterance.
According to a second aspect, the invention provides a method of determining whether an input utterance to a speech recognition system is correctly recognized by the system or whether a recognition result is incorrect, the method comprising the steps of determining a likely recognition result for an input utterance, estimating a confidence measure for the likely recognition result utilizing the method described above, determining a threshold, comparing the threshold with the confidence measure, and accepting or rejecting the recognition result according to whether the confidence measure is above or below the threshold.
Preferably, the step of determining a threshold comprises weighting the threshold depending on the noise level in an input signal containing the input utterance. The threshold is preferably weighted according to a signal to noise ratio of the input signal. In a preferred embodiment, the weighting has a first value at low noise levels, a second value at high noise levels, and varies between the first and second levels at intermediate noise levels. Preferably, the first value is 1 when the signal to noise ratio of the input signal is greater than approximately 15, and the second value is 0 when the signal to noise ratio of the input signal is smaller than approximately 8. The weighting (W) is preferably given by:
W=(SNRxe2x88x928)/7
for signal to noise ratio (SNR) values between approximately 8 and 15.
Preferably, the step of determining a threshold comprises weighting the threshold depending on the number of predetermined models that the input utterance is compared with. The weighting (W) is preferably given by
W=0.6+1.08xc3x97exe2x88x92VS/10.0
where the number of predetermined models (VS) is 2 or more.