Over the last few decades, the focus in ASR has gradually shifted from laboratory experiments performed on carefully enunciated speech received by high-fidelity equipment in quiet environments to real applications having to cope with normal speech received by low-cost equipment in noisy environments.
In the latter case, an ASR system has to be robust to at least two sources of distortion. One is additive in nature—background noise, such as a computer fan, a car engine or road noise. The other is convolutive in nature—changes in microphone type (e.g., a hand-held microphone or a hands-free microphone) or position relative to the speaker's mouth. In mobile applications of speech recognition, both background noise and microphone type and relative position are subject to change. Therefore, it is critical that ASR systems be able to compensate for the two distortions jointly.
Various approaches have been taken to address this problem. One approach involves pursuing features that are inherently robust to distortions. Techniques using this approach include relative spectral technique-perceptual linear prediction, or RASTA-PLP, analysis (see, e.g., Hermansky, et al., “Rasta-PLP Speech Analysis Technique,” in ICASSP, 1992, pp. 121-124) and cepstral normalization such as cepstrum mean normalization, or CMN, analysis (see, e.g., Rahim, et al., “Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 1, pp. 19-30, January 1996) and histogram normalization (see, e.g., Hilger, et al., “Quantile Based Histogram Equalization for Noise Robust Speech Recognition,” in EUROSPEECH, 2001, pp. 1135-1138). The second approach is called “feature compensation,” and works to reduce distortions of features caused by environmental interference.
Spectral subtraction (see, e.g., Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. on ASSP, vol. 27, pp. 113-120, 1979) is widely used to mitigate additive noise. More recently, the European Telecommunications Standards Institute (ETSI) proposed an advanced front-end (see, e.g., Macho, et al., “Evaluation of a Noise-Robust DSR Front-End on Aurora Databases” in ICSLP, 2002, pp. 17-20) that combines Wiener filtering with CMN.
Using stereo data for training and testing, compensation vectors may be estimated via code-dependent cepstral normalization, or CDCN, analysis (see, e.g., Acero, et al., “Environment Robustness in Automatic Speech Recognition” in ICASSP 1990, 849-852) and SPLICE (see, e.g., Deng, et al., “High-Performance Robust Speech Recognition Using Stereo Training Data,” in ICASSP, 2001, pp. 301-304). Unfortunately, stereo data is unheard-of in mobile applications.
Another approach involves vector Taylor series, or VTS, analysis (see, e.g., Moreno, et al., “A Vector Taylor Series Approach for Environment-Independent Speech Recognition,” in ICASSP, 1996, vol. 2, pp. 733-736), which uses a model of environmental effects to recover unobserved clean speech features.
The third approach is called “model compensation.” Probably the most well-known model compensation techniques are multi-condition training and single-pass retraining. Unfortunately, these techniques require a large database to cover a variety of environments, which renders them unsuitable for mobile or other applications where computing resources are limited.
Other model compensation techniques make use of maximum likelihood linear regression (MLLR) (see, e.g., Woodland, et al., “Improving Environmental Robustness in Large Vocabulary Speech Recognition,” in ICASSP, 1996, pp. 65-68, and Sankar, et al., “A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 3, pp. 190-201, 1996) or maximum a posteriori probability estimation (see, e.g., Chou, et al. “Maximum A Posterior Linear Regression based Variance Adaptation on Continuous Density HMMs” technical report ALR-2002-045, Avaya Labs Research, 2002) to estimate transformation matrices from a smaller set of adaptation data. However, such estimation still requires a relatively large amount of adaptation data, which may not be available in mobile applications.
Using an explicit model of environment effects, the method of parallel model combination, or PMC (see, e.g., Gales, et al., “Robust Continuous Speech Recognition using Parallel Model Combination” in IEEE Trans. On Speech and Audio Processing, vol. 4, no. 5, 1996, pp. 352-359) and its extensions, such as sequential compensation (see, e.g., Yao, et al., “Noise Adaptive Speech Recognition Based on Sequential Noise Parameter Estimation,” Speech Communication, vol. 42, no. 1, pp. 5-23, 2004) may adapt model parameters with fewer frames of noisy speech. However, for mobile applications with limited computing resources, direct use of model compensation methods such as Gales, et al., and Yao, et al., both supra, almost always prove impractical.
What is needed in the art is a superior system and method for model compensation that functions well in a variety of background noise and microphone environments, particularly noisy environments, and is suitable for applications where computing resources are limited, e.g., digital signal processors (DSPs), especially those in mobile applications.