The structure of a typical continuous speech recognizer consists of a front-end feature analysis stage followed by a statistical pattern classifier. The feature vector, interface between these two, should ideally contain all the information of the speech signal relevant to subsequent classification, be insensitive to irrelevant variations due to changes in the acoustic environments, and at the same time have a low dimensionality in order to minimize the computational demands of the classifier. Several types of feature vectors have been proposed as approximations of the ideal speech recognizer, as in the article by J. W. Picone, entitled "Signal Modeling Techniques in Speech Recognition", Proceedings of the IEEE, Vol. 81, No. 9, 1993, pp.1215-1247. Most speech recognizers have traditionally utilized cepstral parameters derived from a linear predictive (LP) analysis due to the advantages that LP analysis provides in terms of generating a smooth spectrum, free of pitch harmonics, and its ability to model spectral peaks reasonably well. Mel-based cepstral parameters, on the other hand, take advantage of the perception properties of the human auditory system by sampling the spectrum at mel-scale intervals. Logically, combining the merits of both LP analysis and mel-filter bank analysis should, in theory, produce an improved set of cepstral features.
This can be performed in several ways. For example, one could compute the log magnitude spectrum of the LP parameters and then warp the frequencies to correspond to the mel-scale. Previous studies have reported encouraging speech recognition results when warping the LP spectrum by a bilinear transformation prior to computing the cepstrum, as opposed to not using the warping such as in M. Rahim and B. H. Juang, "Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition", IEEE Transactions on Speech and Audio Processing, Vol. 4, No. 1, 1996, pp. 19-30. Several other frequency warping techniques have been proposed, for example in H. W. Strube, "Linear Prediction on a Warped Frequency Scale", Journal of Acoustical Society of America, Vol. 68, No.4, 1980, pp. 1071-1076, a mel-like spectral warping method through all-pass filtering in the time domain is proposed. Another approach is to apply mel-filter bank analysis on the signal followed by LP analysis to give what will be refereed to as mel linear predictive cepstral (mel-lpc) features (see M. Rahim and B. H. Juang, "Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition", EEE Transactions on Speech and Audio Processing}, Vol. 4, No. 1, 1996, pp. 19-30). The computation of the mel-lpc features is similar in some sense to perceptual linear prediction PLP coefficients explained by H. Hermansky, in "Perceptual Linear Predictive (PLP) analysis of Speech", Journal of Acoustical Society of America, Vol. 87, No. 4, 1990, pp. 1738-1752. Both techniques apply a mel filter bank prior to LP analysis. However, the mel-lpc uses a higher order LP analysis with no perceptual weighting or amplitude compression. All the above techniques are attempts to perceptually model the spectrum of the speech signal for improved speech quality, and to provide more efficient representation of the spectrum for speech analysis, synthesis and recognition in a whole band approach.
In recent years there has been some work on subband-based feature extraction techniques, such as H. Bourlard and S. Dupont, "Subband-Based Speech Recognition", Proc. ICASSP, 1997, pp. 1251-1254; P. McCourt, S. Vaseghi and N. Harte, "Multi-Resolution Cepstral Features for Phoneme Recognition Across Speech Subbands", Proc. ICASSP, 1998, pp. 557-560. S. Okawa, E. Bocchieri and A. Potamianos, "Multi-Band Speech Recognition in Noisy Environments", Proc. ICASSP, 1998, pp. 641-644; and S. Tibrewala and H. Hermansky, "Subband Based Recognition of Noisy Speech", Proc. ICASSP, 1997, pp. 1255-1258. The article P. McCourt, S. Vaseghi and N. Harte, "Multi-Resolution Cepstral Features for Phoneme Recognition Across Speech Subbands", Proc. ICASSP, 1998, pp. 557-560 indicates that use of multiple resolution levels yield no further advantage. Additionally , a recent theoretical and empirical results have shown that auto-regressive spectral estimation from subbands is more robust and more efficient than full-band auto-regressive spectral estimation S. Rao and W. A. Pearlman, "Analysis of Linear Prediction, Coding and Spectral Estimation from Subbands", IEEE Transactions on Information Theory, Vol. 42, 1996, pp. 1160-1178.
As the articles cited above tend to indicate, there is still a need for advances and improvements in the art of speech recognizers.
It is an object of the present invention to provide a speech recognizer that has the advantages of both a linear predictive analysis and a subband analysis.