This invention relates to speech recognition and verification and more particularly to speech models for automatic speech recognition and speaker verification.
Texas Instruments Incorporated is presently fielding telecommunications systems for Spoken Speed Dialing (SSD) and Speaker Verification in which a user may place calls or be verified by using voice inputs only. These types of tasks require the speech processing system to elicit phrases from the user, and create models of the unique phrases provided during a procedure termed enrollment. The enrollment task requires the user to say each phrase several times. The system must create speech models from this limited speech data. The accuracy with which the system creates the speech models ultimately determines the level of performance of the application. Hence, procedures which improve speech models will provide performance improvement.
There are two distinct problems associated with creating such speech models in realistic environments. The first problem is locating speech within utterances of the phrases. In a noisy environment speech may be missed. Typically, Texas Instruments Incorporated and others have examined the energy profile and other features of the speech signal to locate speech segments. In a noisy environment this is a difficult task. Often the energy-based location algorithms miss speech segments because the algorithms are tuned to ensure noise is not mistaken as speech.
The second problem is variability in the way a user says a name during enrollment. If the name contains multiple words, such as a xe2x80x9cJohn Doexe2x80x9d, the user may or may not pause between the words. If the user says the words without pause, a practical locating and model-building algorithm can not determine that multiple words were spoken. The algorithm will proceed to create a model for a single word with no pause. Then, when the system attempts to recognize the name spoken with an intermediate pause, the system will often fail. A less severe mismatch takes place when the opposite occurs. If the user pauses between words during enrollment, then the enrollment algorithm can spot the pause. However, if the user does not insert the pause during recognition, often the words are spoken in a shorter manner and coarticulation acoustic effects are present between the two words.
The present invention describes methods and apparatus developed to mitigate both of the problems.
In accordance with one preferred embodiment of the present invention a unique garbage model restricted to meet the phonotactic constraints of a language or group of languages is provided for locating speech in the presence of other sounds including spurious inhalation, exhalation, noise sounds, and background silence. In accordance with another embodiment of the present invention, a unique method of constructing models of the located speech segments in an utterance is provided. In accordance with another embodiment of the present invention, a speech recognition system is provided to locate speech in an utterance using the unique garbage model. in accordance with a still further embodiment of the present invention, a speech enrollment method is provided using a speech recognition system that utilizes the unique garbage model.
These and other features of the invention will be apparent to those skilled in the art from the following detailed description of the invention, taken together with the accompanying drawings.