1. Field of the Invention
The invention relates to methods of assessing decoys for use in an audio recognition process, to methods of audio recognition for identifying predetermined sounds in an unknown input audio signal, using decoys, to apparatus and to software for such methods.
2. Background Art
It is known to perform pattern matching such as speech recognition, using steps of:
1) matching an unknown input against a number of models of known speech, (a lexicon) PA0 2) classifying the results (termed tokens), e.g. determining if the closest match is likely to be correct, with or without a positive rejection step.
Classifying recognition results without rejection is usually simple--the recognizer's top choice is either correct or wrong. With rejection, things are a little more complicated. Rejection attempts to detect when the recognition result is incorrect, either because the person said something that is outside the lexicon or because the recognizer has made an error. If the person has said something that is outside the lexicon, the utterance is called a non-vocabulary utterance, referred to herein at times as an imposter utterance. For example, a typical speech recognition application could have about 10% non-vocabulary utterances, which means that 10% of the time, the person says something that is not in the recognizer's vocabulary. The result, after rejection, is classified as one of:
correct acceptance (CA): The recognizer's top choice will lead to performing the correct action, and the rejection algorithm accepts the result (note that this does not mean that the recognizer has gotten every word correct, but just that it has gotten all the important ones correct. For example, in the locality task, it has gotten the locality correct but may have the wrong prefix or suffix). PA1 false acceptance (FA): The top choice is incorrect, either because of a recognition error, or because the token is a non-vocabulary utterance but is not rejected. PA1 correct rejection (CR): The token is an imposter and it is rejected. PA1 false rejection (FR): The token is not a non-vocabulary utterance, but it is rejected (note that if the top choice of the recognizer is wrong, the rejection algorithm is correct to reject the result, but it is still referred to as a false rejection because the notion of correct and false are relative to what the speaker intended, and not the recognizer). PA1 carrying out a test recognition process by matching known training audio signals to models representing the predetermined sounds and the decoys; and PA1 determining for each of the decoys, from the results of the test recognition process, a score representing the effect of the respective decoy in the recognition of any of the known training audio signals. An advantage arising from generating scores for decoys is that the chance of a poor selection of decoys can be reduced. Thus the possibility of poor recognition performance arising from poorly selected decoys can be reduced. Furthermore, the requirement for expert input into the decoy creation process, which may be time consuming, can be reduced. This can make it easier, or quicker, or less expensive to install or adapt to particular circumstances. Also, better rejection, or less false acceptance may be achieved if some decoys are identified which are unexpectedly good. PA1 determining whether the respective decoy is a closer match to a given one of the known training audio signals than the best matching predetermined sound; and PA1 determining the score according to the result. An advantage arising from this is that it helps determine objectively which are particularly good, and which are particularly bad decoys. PA1 determining how many other decoys are a closer match to the given one of the known training audio signals than the best matching predetermined sound; and PA1 determining the score according to the result. An advantage arising from this is that it helps determine objectively which are critically good, or critically bad decoys. PA1 determining how close a match the respective decoy is relative to any other decoys which are a closer match to the given one of the known training audio signals than the best matching predetermined sound; and PA1 determining the score according to the result. An advantage arising from this is that it helps to refine the scoring further, to distinguish between clusters of decoys. PA1 performing the audio recognition process for identifying predetermined sounds in an unknown input audio signal by matching the unknown input audio signal to models of the predetermined sounds and the decoys, according to the scores associated with the decoys. If the decoys are well selected, rejection of non-vocabulary utterances can be improved, and/or recognition can be improved. The latter occurs by reducing confusion between similar words in the vocabulary, by selection of decoys which lie in between the similar words, and thus can suppress incorrect recognition.
The other commonly used term is "forced choice accuracy," which refers to the number of times the recognizer's top choice is correct, without considering rejection. The maximum value for forced choice accuracy is 100% minus the non-vocabulary utterance rate. The forced choice accuracy is the maximum possible value for CA, which occurs when the rejection algorithm accepts all correct recognitions. Typically, however, a (hopefully) small percentage of the correct recognitions are rejected, so that CA is less (typically, on the order of 10%) than the forced choice accuracy.
Classification of a token as a CR or FR is sometimes altered by the definition of a non-vocabulary utterance, because of the notion of word spotting. The goal of a true word-spotting system is to pick out the important words, regardless of what the speaker may say before, between, or after them. Technically, if a person says something with a valid core, but an invalid prefix or suffix (where invalid means it is not in the supported prefix or suffix), the token is a non-vocabulary utterance. In the past, such a token has been considered correctly accepted if the recognizer gets the core right, but also correctly rejected if the token is rejected. To be consistent, one definition should be used, and the trend is towards considering a token to be a non-vocabulary utterance only if it does not have a core, or the core is outside of the supported vocabulary, since the goal is towards having a true word-spotting system. More precisely, the goal is to improve the automation rate, which is achieved by having a recognizer which gets all the important words correct, and realizes when it has made an error on an important word.
Rejection using decoys is known, for example from U.S. Pat. No. 5,097,509 (Lennig). Some non-vocabulary utterances may occur much more frequently than others. For example, non-vocabulary utterance tokens could be "Hello", "Ah", or nothing but telephone noise (the person said nothing at all, but there was enough noise on the line so that the end-pointer did not detect the lack of speech). The most effective way to reject these tokens is to use decoys. A decoy is simply a model for the non-vocabulary utterance that is added to the recognizer's lexicon. If a decoy is the top choice returned by the recognizer, the token is rejected, regardless of the result of any classification algorithm.
However, it is possible that decoys can reduce the effectiveness or speed of the classification, if they produce close matches to utterances that are within the vocabulary. Accordingly decoys need to be carefully selected to suit the application, or the lexicon. This task requires expert input and may be time consuming, thus limiting the breadth of applicability or the ease of installation of new systems.
It is known from U.S. Pat. No. 4,802,231 (Davis) to generate error templates for a pattern matching process such as speech recognition, derived from words input to the recogniser, and erroneously recognised as matching a word in the vocabulary of the recogniser. Composite error templates may be generated by combining error templates.
It is known from U.S. Pat. No. 5,649,057 (Lee at al) to generate statistical models of extraneous speech and background noise, for use in an HMM (Hidden Markov Model) based speech recognition system. The system involves representing a given speech input as a keyword preceded and followed by sequences of such unwanted sounds. A grammar driven continuous word recognition system is used to determine the best-matching sequence of unwanted sounds and keywords. The model or models of the unwanted noises are refined by an iterative training process, i.e. varying the parameters of the HMM until the difference in likelihoods in consecutive iterations is sufficiently small. The iterative process starts with manual input of the keywords, the most important unwanted words, and noise samples, but may be performed automatically thereafter.