Speech recognition of isolated words is used for voice-activated command and control applications. There are usually two modes of activating the recognition system, an "open microphone mode" and a "button activated" or "push-to-talk" mode. In the open microphone mode, the recognizer continuously searches for a match between the acoustic input and the vocabulary of commands which form part of the recognizer. In the button activated mode, the recognizer searches for a match only after the user pushes a button indicating that a command is expected within the next few seconds.
Many speech recognition applications have selected the button activated mode because speech recognition systems perform better on its task: "Given the utterance, which is the most likely word, out of my N known words, that was said?". It is far harder for speech recognition systems to perform the open microphone task of "Does this utterance correspond to one of my N known words?" The reason for this difference is related to the variability in the environment and in the manner of speaking compared to the originally trained (or "known") words.
In each case, recognition scores indicating how close the utterance is to each of the known words are determined. The "open" vocabulary of the open microphone compares the recognition scores to an absolute threshold and is therefore, affected by significant "noises". The "closed" vocabulary of the button activated mode, however, attempts to determine which word was said and thus, compares the recognition scores to each other, selecting the best relative score. Since the noise generally affects all of the scores in the same way, the scores generally rise and fall together and the resultant comparison is not affected by this variability.
Unfortunately, the button activated mode is not fully hands-free since the user has to push a button prior to saying the command.
A known method for improving the acceptance/rejection decision in the open microphone mode is to use background or filler templates which model background or non-relevant speech. The background or filler templates are typically produced from a large database of speech utterances which are not part of the particular vocabulary of the recognizer.
Such a method is described in the article "Word Spotting From Continuous Speech Utterances" by R. C. Rose, Automatic Speech and Speaker Recognition--Advanced Topics, edited by C. H. Lee, F. K. Soong and K. K. Paliwal, Kluwer Academic Publishers, 1996, pp. 303-329. This method is relevant to Hidden Markov Model (HMM) type, speaker independent recognition systems which are described in "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" by L. R. Rabiner, Proceedings of the IEEE, Vol. 7, No. 2, Feb. 1989, pp. 257-286. Both articles are incorporated herein by reference.
In the open microphone mode, the standard measure for the rejection/acceptance capability of a recognition system is the rate of false alarms per vocabulary word, for a given rate of detection. In other words, for a given rate of true recognition of a vocabulary word, how many times did the system claim a vocabulary word was said when it had not been said. Unfortunately, the more words in the vocabulary, the more false alarms there are and the more of a nuisance the system is to the user. Designers have thus tried to reduce the number of vocabulary words in the open microphone mode.
One method to do so without limiting the functionality of the recognition system is to separate the recognition operation into two steps. This method is described in section 6.2 of the article by R. C. Rose and involves using a single or a few keywords, which are recognized in open microphone mode, as an activation element. Once the uttered keyword has been recognized, the method operates in the closed vocabulary mode, selecting the next utterance as one of the words in the closed vocabulary. In effect, the keywords of this method replace the button of the button activation mode described hereinabove.
The above-described two step method provides hands-free operation, as in the open microphone mode, but the number of false alarms is reduced since the vocabulary in the open microphone mode is reduced. Such a mode of operation is natural for menu-type operations where the user activates one of a few functions with a keyword and only afterwards says one of the commands which are relevant to the function.