The present invention relates to speech recognition systems and in particular to speech recognition systems that exploit vocal tract resonances in speech.
In human speech, a great deal of information is contained in the first three or four resonant frequencies of the speech signal. In particular, when a speaker is pronouncing a vowel, the frequencies (and to a less extent, bandwidths) of these resonances indicate which vowel is being spoken.
Such resonant frequencies and bandwidths are often referred to collectively as formants. During sonorant speech, which is typically voiced, formants can be found as spectral prominences in a frequency representation of the speech signal. However, during non-sonorant speech, the formants cannot be found directly as spectral prominences. Because of this, the term “formants” has sometimes been interpreted as only applying to sonorant portions of speech. To avoid confusion, some researchers use the phrase “vocal tract resonance” to refer to formants that occur during both sonorant and non-sonorant speech. In both cases, the resonance is related to only the oral tract portion of the vocal tract.
To detect formants, systems of the prior art analyzed the spectral content of a frame of the speech signal. Since a formant can be at any frequency, the prior art has attempted to limit the search space before identifying a most likely formant value. Under some systems of the prior art, the search space of possible formants is reduced by identifying peaks in the spectral content of the frame. Typically, this is done by using linear predictive coding (LPC) which attempts to find a polynomial that represents the spectral content of a frame of the speech signal. Each of the roots of this polynomial represents a possible resonant frequency in the signal and thus a possible formant.
One system, developed by the present inventors, identified vocal tract resonance frequencies and bandwidths by limiting the possible values of the frequencies and bandwidths to a set of quantized values. This system used a residual model that described the difference between observed feature vectors and a set of simulated feature vectors. The simulated feature vectors were constructed using a function that was a sum of a set of sub-functions. Each sub-function was a non-linear function of one of the vocal tract resonance frequencies and one of the vocal tract resonance bandwidths.
While this system was an improvement over prior art systems, it was still not as fast as desired because training of the residual model parameters required a summation over all possible combinations of values for the vocal tract resonance frequencies and bandwidths. Under one quantization scheme, this required a summation over 20 million possible combinations. Thus, a technique is needed that allows this system to be used without requiring a summation over all possible combinations of the vocal tract resonance frequencies and bandwidths.