The present invention relates generally to speech recognition systems and more particularly to a wavelet-based system for extracting features for recognition that are optimized for different classes of sounds (e.g. fricatives, plosives, other consonants, vowels, and the like).
When analyzing a speech signal, the first step is to extract features which represent the useful information that characterizes the signal. Conventionally, this feature extraction process involves chopping the speech signal into overlapping windows of a predetermined frame size and then computing the Fast Fourier Transform (FFT) upon the signal window. A finite set of cepstral coefficients are then extracted by discarding higher order terms in the Fourier transform of the log spectrum. The resulting cepstral coefficients may then be used to construct speech models, typically Hidden Markov Models.
A significant disadvantage of conventional FFT analysis is its fixed time-frequency resolution. When analyzing speech, it would be desirable to be able to use a plurality of different time-frequency resolutions, to better spot the non-linearly distributed speech information in the time-frequency plane. In other words, it would be desirable if sharper time resolution could be provided for rapidly changing fricatives or other consonants while providing less time resolution for slower changing structures such as vowels. Unfortunately, current technology makes this difficult to achieve. While it is possible to construct and use in parallel a set of recognizers that are each designed for a particular speech feature, such solution carries a heavy computational burden.
The present invention employs wavelet technology that provides one analytical technique which covers a wide assortment of different classes of sounds. Using the wavelet technology of the invention, a single recognizer can be constructed and used in which the speech models have already been optimized for different classes of sounds through a unique feature extraction process. Thus the recognizer of the invention is optimized for different classes of sounds without increasing the complexity of the recognition analysis process.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.