The present invention relates to speech recognition systems and in particular to speech recognition systems that exploit formants in speech.
In human speech, a great deal of information is contained in the first three resonant frequencies or formants of the speech signal. In particular, when a speaker is pronouncing a vowel, the frequencies and bandwidths of the formants indicate which vowel is being spoken.
To detect formants, systems of the prior art analyzed the spectral content of a frame of the speech signal. Since a formant can be at any frequency, the prior art has attempted to limit the search space before identifying a most likely formant value. Under some systems of the prior art, the search space of possible formants is reduced by identifying peaks in the spectral content of the frame. Typically, this is done by using linear predictive coding (LPC) which attempts to find a polynomial that represents the spectral content of a frame of the speech signal. Each of the roots of this polynomial represents a possible resonant frequency in the signal and thus a possible formant. Thus, using LPC, the search space is reduced to those frequencies that form roots of the LPC polynomial.
In other formant tracking systems of the prior art, the search space is reduced by comparing the spectral content of the frame to a set of spectral templates in which formants have been identified by an expert. The closest “n” templates are then selected and used to calculate the formants for the frame. Thus, these systems reduce the search space to those formants associated with the closest templates.
Although systems that reduce the search space operate efficiently, they are prone to errors because they can exclude the frequency of the actual formant when reducing the search space. In addition, because the search space is reduced based on the input signal, formants in different frames of the input signal are identified using different formant search spaces. This is less than ideal because it introduces another layer of possible errors into the search process.
Thus, a formant tracking system is needed that does not reduce the search space in such a way that the formants in different frames of the speech signal are identified using different formant search spaces.