The present invention relates to speech recognition systems generally and to those which are activated by a keyword in particular.
Speech recognition of isolated words is used for voice-activated command and control applications. There are usually two modes of activating the recognition system, an xe2x80x9copen microphone modexe2x80x9d and a xe2x80x9cbutton activatedxe2x80x9d or xe2x80x9cpush-to-talkxe2x80x9d mode. In the open microphone mode, the recognizer continuously searches for a match between the acoustic input and the vocabulary of commands which form part of the recognizer. In the button activated mode, the recognizer searches for a match only after the user pushes a button indicating that a command is expected within the next few seconds.
Many speech recognition applications have selected the button activated mode because speech recognition systems perform better on its task: xe2x80x9cGiven the utterance, which is the most likely word, out of my N known words, that was said?xe2x80x9d. It is far harder for speech recognition systems to perform the open microphone task of xe2x80x9cDoes this utterance correspond to one of my N known words?xe2x80x9d The reason for this difference is related to the variability in the environment and in the manner of speaking compared to the originally trained (or xe2x80x9cknownxe2x80x9d) words.
In each case, recognition scores indicating how close the utterance is to each of the known words are determined. The xe2x80x9copenxe2x80x9d vocabulary of the open microphone compares the recognition scores to an absolute threshold and is therefore, affected by significant xe2x80x9cnoisesxe2x80x9d. The xe2x80x9cclosedxe2x80x9d vocabulary of the button activated mode, however, attempts to determine which word was said and thus, compares the recognition scores to each other, selecting the best relative score. Since the noise generally affects all of the scores in the same way, the scores generally rise and fall together and the resultant comparison is not affected by this variability.
Unfortunately, the button activated mode is not fully hands-free since the user has to push a button prior to saying the command.
A known method for improving the acceptance/rejection decision in the open microphone mode is to use background or filler templates which model background or non-relevant speech. The background or filler templates are typically produced from a large database of speech utterances which are not part of the particular vocabulary of the recognizer.
Such a method is described in the article xe2x80x9cWord Spotting From Continuous Speech Utterancesxe2x80x9d by R. C. Rose, Automatic Speech and Speaker Recognitionxe2x80x94Advanced Topics, edited by C. H. Lee, F. K. Soong and K. K. Paliwal, Kluwer Academic Publishers, 1996, pp. 303-329. This method is relevant to Hidden Markov Model (HMM) type, speaker independent recognition systems which are described in xe2x80x9cA Tutorial on Hidden Markov Models and Selected Applications in Speech Recognitionxe2x80x9d by L. R. Rabiner, Proceedings of the IEEE, Vol. 7, No. 2, February 1989, pp. 257-286. Both articles are incorporated herein by reference.
In the open microphone mode, the standard measure for the rejection/acceptance capability of a recognition system is the rate of false alarms per vocabulary word, for a given rate of detection. In other words, for a given rate of true recognition of a vocabulary word, how many times did the system claim a vocabulary word was said when it had not been said. Unfortunately, the more words in the vocabulary, the more false alarms there are and the more of a nuisance the system is to the user. Designers have thus tried to reduce the number of vocabulary words in the open microphone mode.
One method to do so without limiting the functionality of the recognition system is to separate the recognition operation into two steps. This method is described in section 6.2 of the article by R. C. Rose and involves using a single or a few keywords, which are recognized in open microphone mode, as an activation element. Once the uttered keyword has been recognized, the method operates in the closed vocabulary mode, selecting the next utterance as one of the words in the closed vocabulary. In effect, the keywords of this method replace the button of the button activation mode described hereinabove.
The above-described two step method provides hands-free operation, as in the open microphone mode, but the number of false alarms is reduced since the vocabulary in the open microphone mode is reduced. Such a mode of operation is natural for menu-type operations where the user activates one of a few functions with a keyword and only afterwards says one of the commands which are relevant to the function.
The present invention utilizes two types of templates, that of a keyword (called herein a xe2x80x9ckeyword templatexe2x80x9d) and those of a closed vocabulary (called herein xe2x80x9cvocabulary templatesxe2x80x9d).
It is an object of the present invention to provide a keyword recognition system for speaker dependent, dynamic time warping (DTW) recognition systems. The present invention uses all of the trained templates in the system (keyword and vocabulary) to determine if an utterance is a keyword utterance or not.
Initially, only the keyword template is utilized as a first acceptance criterion. If that criterion is passed, then the utterance is compared to all of the vocabulary templates and their match scores recorded. Only if the match to the keyword is higher than all of the matches to the vocabulary templates, is the utterance accepted as a keyword utterance. At that point, a listening window is opened and the following utterance is compared to each of the utterances of the closed vocabulary. Thus, the present invention utilizes the vocabulary templates as filler templates.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a system and method for recognizing an utterance as a keyword. The system activates a speaker dependent recognition system on a plurality of vocabulary words and includes a pattern matcher and a criterion determiner. The pattern matcher initially matches the utterance to a keyword template and produces a corresponding keyword score indicating the quality of the match between the utterance and the keyword template. The pattern matcher also matches the utterance to a plurality of vocabulary templates, the result being a corresponding plurality of vocabulary scores each indicating the quality of the match between the utterance and one of the vocabulary templates. The criterion determiner selects the utterance as the keyword if the keyword score indicates a significant match to the keyword template and if the keyword score indicates a better match than do the entirety of the vocabulary scores. Once the utterance is accepted as the keyword, the criterion determiner activates the speaker dependent recognition system to match at least a second utterance to the words of the closed vocabulary.
Moreover, in accordance with a preferred embodiment of the present invention, the pattern matcher performs dynamic time warping between the utterance and the relevant one of the templates.
Additionally, in accordance with a preferred embodiment of the present invention, the criterion determiner opens a listening window once the utterance is accepted as the keyword thereby to recognize the words of the closed vocabulary. The pattern matcher then matches at least the second utterance to the vocabulary templates thereby to determine which word of the closed vocabulary was spoken in the second utterance.
Further, in accordance with a preferred embodiment of the present invention, the present invention also includes a preprocessing operation which selects suitable vocabulary templates for use in the keyword recognition. The suitable vocabulary templates are those which are different, by a predetermined criterion, from the keyword template.
Still further, in accordance with a further preferred embodiment of the present invention, there can be more than one keyword template where each is associated with its own vocabulary. The present invention determines which keyword is spoken and accepts the utterance only if the keyword score is large enough and better than the score of the utterance to at least a portion of all of the vocabulary words. The present invention then activates the recognition system on the vocabulary associated with the detected keyword.