This invention relates to a speech feature extraction system for use in a speech recognition, voice identification, voice authentication systems. More specifically, this invention relates to a speech feature extraction that can be used to create a speech recognition system or other speech processing with a reduced error rate.
Generally, a speech recognition system is an apparatus that attempts to identify spoken words by analyzing the speaker""s voice signal. Speech is converted into an electronic form from which features are extracted. The system then attempts to match a sequence of features to previously stored sequence of models associated with known speech units. When a sequence of features matches a sequence of models in accordance with specified rules, the corresponding words are deemed to be recognized by the speech recognition system.
However, background sounds such as radios, car noise, other nearby speakers can make it difficult to extract useful features from the speech. In addition, ambient conditions, such as the use of a different microphones or telephone handsets, a different telephone line, the speaker""s distance from the microphone interfere with system performance. Differences between speakers, changes in speaker intonation or emphasis, and even the speakers health can also adversely impact system performance. For a further description of some of these problems, see Richard A. Quinnell, xe2x80x9cSpeech Recognition: No Longer a Dream, But Still a Challenge,xe2x80x9d EDN Magazine, Jan. 19, 1995, p. 41-46.
In most speech recognition systems, the speech features are extracted by cepstral analysis, which generally involves measuring the energy in specific frequency bands. The product of that analysis reflects the amplitude of the signal in those bands. Analysis of these amplitude changes over successive time periods can be modeled as an amplitude modulated signal.
Whereas the human ear is a sensitive to frequency modulation as well as amplitude modulation in received speech signals, this frequency modulated content is only partially reflected in systems that perform cepstral analysis.
Accordingly, it would be desirable to provide a speech feature extraction system capable of capturing the frequency modulation characteristics of speech, as well as previously known amplitude modulation characteristics.
It also would be desirable to provide speech recognition and other speech processing systems that incorporate feature extraction systems that provide information on frequency modulation characteristics of the input speech signal.
In view of the foregoing, it is an object of the present invention to provide a speech feature extraction system capable of capturing the frequency modulation characteristics of speech, as well as previously known amplitude modulation characteristics.
It also is an object of this invention to provide speech recognition and other speech processing systems that incorporate feature extraction systems that provide information on frequency modulation characteristics of the input speech signal.
The present invention provides a speech feature extraction system that reflects frequency modulation characteristics of speech as well as amplitude characteristics. This is done by a feature extraction stage that included a plurality of complex band pass filters in adjacent frequency bands. The output of alternate complex band pass filters is multiplied by the conjugate of the output of the bandpass filter in the adjacent lower frequency band and the resulting signal is low pass filtered.
Each of the low pass filter outputs is processed to compute two components: a FM component that is substantially sensitive to the frequency of the signal passed by the adjacent bandpass filters from which the low pass filter output was generated, and an AM component that is substantially sensitive to the amplitude of the signal passed by the adjacent bandpass filters. The FM component reflects the difference in the phase of the outputs of the adjacent bandpass filters used to generate the lowpass filter output.
The AM and FM components are then processed using known feature enhancement techniques, such as discrete cosine transform, melscale translation, mean normalization, delta and acceleration analysis, linear discriminant analysis and principal component analysis, to generate speech features suitable for statistical processing or other recognition or identification methods.