1. Field of the Invention
The field of the invention is both noisy and clean speech recognition.
2. Description of Related Art
While most modem speech recognition systems focus on the speech short-term spectrum for feature analysis-extraction, also referred to as the “front-end” analysis, the technique attempts to capture information on the vocal tract transfer function from the gross spectral shape of the input speech, while eliminating as much as possible the irrelevant effects of excitation signals. However, the accuracy and robustness of the speech representation may deteriorate dramatically due to the spectral distortion caused by the additive background noise. Also, noise robust feature extraction poses a great challenge in the design of high performance automatic speech recognition systems. Over the last several decades, a number of speech spectral representations have been developed, among which the mel-frequency cepstral coefficients (MFCC) have become most popular. [M. J. Hunt, “Spectral signal processing for ASR”, Proc. ASRU'99, December 1999 and S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences”, IEEE Trans. Acoust., Speech, Signal Processing, pp. 357-366, vol. 28, August 1980]. The MFCCs, though adopted by most ASR systems for its superiority in clean speech recognition, do not cope well with noisy speech. The alternative perceptual linear prediction (PLP) coefficients promise improvement over MFCC in noisy conditions by incorporating perceptual features of the human auditory mechanism. Nevertheless, it is believed that the existing front ends are sub-optimal, and the discovery of new noise-immune or noise-insensitive features is needed.
Two problems plague conventional MFCC front-end analysis techniques. The first is concerned with the vocal tract transfer function whose accurate description is crucial to effective speech recognition. However, the irrelevant information of excitation signals must be removed for accurate spectral representation. In the MFCC approach, a smoothed version of the short-term speech spectrum is computed from the output energy of a bank of filters, i.e., the spectrum envelope is computed from energy averaged over each mel-scaled filter. While such a procedure is fast and efficient, it is inaccurate as the vocal tract transfer function information is known to reside in the spectral envelope which is mismatched with the smoothed spectrum, especially for voiced sounds and transitional speech. Alternative approaches based on direct spectral envelope estimation have been reported. [H. K. Kim and H. S. Lee, “Use of spectral autocorrelation in spectral envelope linear prediction for speech recognition”, IEEE Trans. Speech and Audio Processing, vol. 7, no. 5, pp. 533-541, 1999].
Moreover, the spectrum envelope tends to have much higher signal to noise ratio (SNR) than smoothed spectrum under the same noise conditions, which leads to a more robust representation of the vocal tract transfer function. Hence, speech features derived from the spectral envelope are expected to provide better performance in noisy environments compared with traditional front ends based on smoothed spectrum [Q. Zhu and A. Alwan, “AM-demodulation of speech spectra and its application to noise robust speech recognition”, Proc. ICSLP'2000, October 2000]. Thus, the MFCC approach may not work well for voiced sounds with quasi-periodic features, as the formant frequencies tend to be biased toward pitch harmonics, and formant bandwidth may be misestimated. Experiments show that this mismatch substantially increases the feature variance within the same utterance.
Another difficulty encountered in conventional accoustic analysis (e.g., MFCC) is that of appropriate spectral amplitude transformation for higher recognition performance. The log power spectrum representation in MFCC is clearly attractive because of its gain-invariance properties and the approximate Gaussian distributions it thus provides. Cubic root representation is used in the PLP representation for psychophysical considerations, at the cost of compromising the level-invariance properties and hence robustness. [H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech”, J. Acoust. Soc. America, pp. 1738-1752, vol. 87, no. 4, April 1990].
Modern speech recognition systems retrieve information on the vocal tract transfer function from the gross spectral shape. The speech signal is generated via modulation by an excitation signal that is quasi-periodic for voiced sounds, and white noise for unvoiced sounds. A typical approach, employed in MFCC and PLP, is to compute the energy output of a bank of band-pass mel-scaled or bark-scaled filters, whose bandwidths are broad enough to remove fine harmonic structures caused by the quasi-periodic excitation of voiced speech. The efficiency and effectiveness of these spectral smoothing approaches led to their popularity. However, there are two drawbacks that significantly deteriorate their accuracy.
The first drawback is the limited ability to remove undesired harmonic structures. In order to maintain adequate spectral resolution, the standard filter bandwidth in MFCC and PLP is usually in the range of 200 Hz-300 Hz in the low frequency region. It is hence sufficiently broad for typical male speakers, but not broad enough for high pitch (up to 450 Hz) female speakers. Consequently, the formant frequencies are biased towards pitch harmonics and their bandwidth is misestimated.
The second drawback concerns information extraction to characterize the vocal tract function. It is widely agreed in the speech coding community that it is the spectral envelope and not the gross spectrum that represents the shape of the vocal tract [M. Jelinek, et al., supra]. Although the smoothed spectrum is often similar to the spectral envelope of unvoiced sounds, the situation is quite different in the case of voiced and transitional sounds. Experiments show that this mismatch substantially increases the spectrum variation within the same utterance. This phenomenon is illustrated in FIG. 1 with the stationary part of the voiced sound [a]. FIG. 1 demonstrates that the upper envelope of the power spectrum sampled at pitch harmonics is nearly unchanged, while the variation of the lower envelope is considerable. The conventional smoothed spectrum representation may be roughly viewed as averaging the upper and lower envelopes. It therefore exhibits much more variation than the upper spectrum envelope alone.
The third drawback is the high spectral sensitivity to background noise. The conventional smoothed spectrum representation may be roughly viewed as averaging the upper and lower envelopes. It therefore exhibits much higher SNR than the upper spectrum envelope alone in noisy conditions.
Although some of the loss caused by the imprecision of spectrum smoothing may be compensated for and masked by higher complexity statistical modeling, the recognition rate eventually reaches saturation at high model complexity. The present invention discloses that the sub-optimality of the front-end is currently a major performance bottleneck of powerful, high complexity speech recognizers. Therefore, the present invention discloses the alternative of Harmonic Cepstral Coefficients (HCC), as a more accurate spectral envelope representation.