1. Technical Field
The present invention relates in general to the field of speech utterance analysis and in particular to the field of recognition of unknown speech utterances. Still more particularly, the present invention relates to a method and apparatus for speech analysis and recognition which utilizes the power content of a speech utterance over time.
2. Description of the Related Art
Speech analysis and speech recognition algorithms, machines and devices are becoming more and more common in the prior art. Such systems have become increasingly powerful and less expensive. Speech recognition systems are typically "trained" or "untrained." A trained speech recognition system is a system which may be utilized to recognize a speech utterance by an individual speaker after having been "trained" by that speaker utilizing a repetitive pronunciation of the vocabulary in question. A "untrained" speech recognition system is a system which attempts to recognize an unknown speech utterance by an unknown speaker by comparing various acoustic parameters of that utterance to a previously stored finite number of templates which are utilized to represent various known utterances.
Most speech recognition systems in the prior art are frame-based systems, that is, these systems represent speech as a sequence of temporal frames, each of which represents the acoustic parameters of a speech utterance at one of a succession of brief time periods. Such systems typically represent the speech utterance to be recognized as a sequence of spectral frames, in which each frame contains a plurality of spectral parameters, each of which representing the energy at one of a series of different frequency bands. Typically such systems compare the sequence of frames to be recognized against a plurality of acoustic models, each of which describes, or models, the frames associated with a given speech utterance, such as a phoneme, word or phrase.
The human vocal track is capable of producing multiple resonances simultaneously. The frequencies of these resonances change as a speaker moves his tongue, lips or other parts of his vocal track to make different speech sounds. Each of these resonances is referred to as a formant, and speech scientists have found that many individual speech sounds, or phonemes may be distinguished by the frequency of the first three formants. Many speech recognition systems have attempted to recognize an unknown utterance by an analysis of these formant frequencies; however, the complexity of the speech utterance makes such systems difficult to implement.
Many researchers in the speech recognition areas believe that changes in frequency are important to enable a system to distinguish between similar speech sounds. For example, it is possible for two different frames to have similar spectral parameters and yet be associated with very different sounds, because one sound will occur in a context of a rising formant while the other occurs in the context of a falling formant. U.S. Pat. No. 4,805,218 discloses a system which attempts to implement a speech recognition system by making use of information about changes in the acoustic parameters of the speech energy.
Other systems in the prior art have attempted to explicitly detect frequency changes by means of formant tracking. Formant tracking involves analyzing the spectrum of speech energy at successive points in time and determining at each such time the location of the major resonances, or formants, of the speech signal. Once the formants have been identified at successive points in time, their resulting pattern over time may be supplied to a pattern recognizer which is utilized to associate certain formant patterns with selected phonemes.
The goal of all such speech recognition systems is to create a system which can provide a high degree of accuracy in detecting and understanding unknown speech utterances by a broad spectrum of speakers. Thus, it should be obvious that a need exists for a speech recognition system which may be utilized to analyze and recognize unknown speech utterances with a high degree of accuracy.