Due to recent advances in computer technology and improved speech recognition algorithms, speech recognition machines have begun to appear in the past several decades, and have become increasingly more powerful and less expensive.
Most speech recognition systems are frame-based systems, that is, they represent speech as a sequence of frames, each of which represents speech sounds at one of a succession of brief time periods. One such frame-based system is that described in U.S. patent application Ser. No. 797,249, entitled "Speech Recognition Apparatus and Method", which is assigned to the assignee of the present application, and which is incorporated herein by reference. This system represents speech to be recognized as a sequence of spectral frames, in which each frame contains a plurality of spectral parameters, each of which represents the energy at one of a series of different frequency bands. Usually such systems compare the sequence of frames to be recognized against a plurality of acoustic models, each of which describes, or models, the frames associated with a given speech unit, such as a phoneme or a word.
One problem with spectral-frame-based systems is that each frame gives a "snap shot" of spectral energy at one point in time. Thus the individual frames contain no information about whether the energy at various parts of the audio spectrum is going up or is going down in amplitude or frequency.
The human vocal tract is capable of producing multiple resonances at one time. The frequencies of these resonances change as a speaker moves his tongue, lips, and other parts of his vocal track to make different speech sounds. Each of these resonances is referred to as a formant, and speech scientists have found that many individual speech sounds, or phonemes, can be distinguished by the frequency of the first three formants.
Often, however, changes in frequency are important for distinguishing speech sounds. For example, it is possible for two different frames to have similar spectral parameters and yet be associated with very different sounds, because one occurs in a context of a rising formant while the other occurs in a context of a falling formant. Thus it is important for a speech recognition systems to recognize the changes in frequencies as well as the frequencies themselves.
One methods by which the prior art has dealt with the changes in frequencies and other acoustic parameters in frame-based speech recognition systems is by comparing a sequences of frames to be recognized against models of speech units which are formed of a sequence of frame models. Such speech-unit models represent changes in acoustic parameters that take place over the course of the speech-unit they model by using a sequence of frame models with differing parameters. Commonly such systems use dynamic programming algorithms, such as the one described in the above mentioned application Ser. No. 727,249, to find the optimal match between the sequence of frames to be recognized and the speech-unit model's sequence of frame models.
Although the use of such sequential speech-unit models can represent and recognize changes in frequency, its is often not as accurate as desired. This is particularly true if the number of frame models in each speech-unit model is limited to reduce computation. Furthermore, such a sequential speech-unit model is not applicable when one is attempting to place a phonemic label on an individual frame by comparing it with individual frame models, each of which represents a given phoneme.
The prior art has also attempted to explicitly detect frequency changes by means of formant tracking. Format tracking involves analyzing the spectrum of speech energy at successive points in time and determining at each such time the location of the major resonances, or formants, of the speech signal. Once the formants have been identified at successive points in time, their resulting pattern over time can be supplied to a pattern recognizer which associates certain formant patterns with certain phonemes.
Although formant tracking has certain benefits, it also has certain problems. For one thing, it requires a lot of computation. More importantly, even with a large amount of computation, present-day formant trackers often make errors, such as erroneously determining that a given frequency range contains one formant, when it actually contains two, or determining that it contains two, when it actually contains one. Such mistakes tend to cause speech recognition errors.