This invention relates to speech recognition and more particularly to enrollment of voice commands which can be recognized to trigger actions.
There is a growing demand for voice commands recognition. It has been used for voice name dialing for telephone and user-specific commands such as car controls, computer operations and almost everything that would use the hands to trigger an action. It is even being considered for browsing the Internet. It is the accuracy of the recognition that is important and that is dependent on models generated during enrollment. The recognition of voice commands requires the construction of HMM models on enrollment, during which utterance is recorded and need to build the HMM of the command. Depending on the model level, two types of HMMs can be used. A first and most common type is word-based models which models the whole command (may be several words as a single unit). The second type is phone-based which uses a concatenation of phone-like sub-word units to model a command. The sub-word unit can be represented using speaker-independent HMM as described by N. Jain, R. Cole and E. Barnard in article entitled xe2x80x9cCreating Speaker-Specific Phonetic Templates with Speaker-Independent Phonetic Recognizerxe2x80x9d; In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, page 881-884, Atlanta, May 1996, or speaker specific HMM. While word-based HMMs is easier to train, phone-based HMM has many advantages including various degree of distribution tying and rejection based on phone durations.
In accordance with one embodiment of the present inventions applicants teach the construction of phone-based HMM for speaker-specific command enrollment comprising the steps of providing a set (H) of speaker-independent phone-based HMMs, providing a gammer (G) comprising a loop of phones with optional between phone silence (BWS) and two utterance (U1 and U2) of the command produced by the enrollment speaker and wherein the first frames of the first utterance containing only background noise, generating a sequence of phone-like unit HMMS and generating the number of HMMs in that sequence.