1. Field
The present invention relates generally to the field of communications, and more specifically to voice recognition.
2. Background
Voice recognition (VR) in the past (also commonly referred to as speech recognition) represents one of the most important techniques to endow a machine with simulated intelligence to recognize user or user-voiced commands and to facilitate human interface with the machine. Voice recognition devices are classified as either speaker-dependent (SD) or speaker-independent (SI) devices. Speaker-dependent devices, which are more common, are trained to recognize commands from particular users. In contrast, speaker-independent devices are capable of accepting voice commands from any user. To increase the performance of a given VR system, whether speaker-dependent or speaker-independent, a procedure called training is required to equip the system with valid parameters. In other words, the system needs to learn before it can function optimally.
A speaker-dependent VR system is called a speaker-dependent voice recognition engine (SDVR engine) and a speaker-independent VR system is called a speaker-independent voice recognition engine (SIVR engine). An SDVR engine is more useful than an SIVR engine for recognizing nametags such as names of people or organizations because SDVR engines are trained by a user to recognize nametags. A nametag is an identifier that identifies user-defined information. An SIVR engine is more useful than an SDVR engine for recognizing control words such as digits and keywords engines because SIVR engines do not have to be trained by a user to recognize control words. Thus, it is desirable to combine an SDVR engine with an SIVR engine to recognize both nametags and control words.
Both speaker-independent (SI) Hidden Markov Model (HMM) VR engines and speaker-independent Dynamic Time Warping (DTW) VR engines are useful for recognizing control words, but they may yield different results because they analyze an input speech signal differently. Combining these VR engines may use a greater amount of information in the input speech signal than either VR engine would alone. Consequently, a VR system that combines an SI-HMM with an SI-DTW may provide enhanced accuracy.
An SD-DTW VR engine is speaker adaptive. It adapts to a speaker""s speech. An SD-DTW VR (adaptation) engine can be trained by a user to recognizing user-defined and trained control words. Combining an SD-DTW (Adaptation) VR engine with an SI-DTW VR engine enables a VR system to recognize user-defined control words as well as digits and predefined control words.
A system and method for combining an SDVR engine with an SIVR engine and combining SI-HMM VR engines, SI-DTW VR engines, and SD-DTW engines is described in U.S. patent application Ser. No. 09/618,177 (hereinafter ""177 application) entitled xe2x80x9cCombined Engine System and Method for Voice Recognitionxe2x80x9d, filed Jul. 18, 2000, and U.S. patent application Ser. No. 09/657,760 (hereinafter ""760 application) entitled xe2x80x9cSystem and Method for Automatic Voice Recognition Using Mapping,xe2x80x9d filed Sep. 8, 2000, which are assigned to the assignee of the present invention and fully incorporated herein by reference.
It would be desirable to combine an SI-DTW VR engine with an SD-DTW VR (adaptation) engine to create a combined SI-DTW/SD-DTW(adaptation) VR engine to recognize predefined control words and user-defined control words. It would be desirable to combine an SI-HMM VR engine with the combined SI-DTW/SD-DTW(adaptation) VR engine to generate a combined SI-HMM/SI-DTW/SD-DTW(adaptation) VR engine. The combined SI-HMM/SI-DTW/SD-DTW(adaptation) VR engine would use a greater amount of information in the input speech signal than would having either the HMM VR engine or the combined SI-DTW/SD-DTW(adaptation) VR engine operate alone. It would be desirable to combine an SD-DTW VR (nametag) engine that recognizes nametags with the combined SI-HMM/SI-DTW/SD-DTW(adaptation) VR engine to generate a combined SD-DTW(nametag)/SI-HMM/SI-DTW/SD-DTW(adaptation) VR engine. The combined SD-DTW(nametag)/SI-HMM/SI-DTW/SD-DTW(adaptation) VR engine would recognize user-defined control words, predefined control words, and nametags.
Embodiments disclosed herein address the above stated needs by combining a plurality of VR engines to recognize predefined digits and control words, user-defined digits and control words, and nametags. In one aspect, a voice recognition system comprises an acoustic processor configured to extract speech parameters from a speech segment, a plurality of different voice recognition engines coupled to the acoustic processor, each voice recognition engine configured to produce a hypothesis and a corresponding score, wherein the score represents a distance from the speech segment to the hypothesis, and decision logic configured to receive the hypotheses from the plurality of different voice recognition engines and selecting a hypothesis by computing a best score for each of the plurality of voice recognition engines and weighting the best scores of the plurality of voice recognition engines. In another aspect, the voice recognition system decision logic is configured to multiply each best score by a coefficient associated with the hypothesis corresponding to the best score to generate a plurality of weighted best scores, and to combine the plurality of weighted best scores to yield a combined score.