Over the last few decades, the focus in automatic speech recognition (ASR) has gradually shifted from laboratory experiments performed on carefully enunciated speech received by high-fidelity equipment in quiet environments to real applications having to cope with normal speech received by low-cost equipment in noisy environments. In noisy environments, an ASR system may often be required to work with mismatched conditions between pre-trained speaker-independent acoustic models and a speaker-dependent voice signal.
Mismatches are often caused by environmental distortions. These environmental distortions may be additive in nature from background noise such as a computer fan, a car engine, wind noise, or road noise (see, e.g., Gong, “A Method of Joint Compensation of Additive and Convolutive Distortions for Speaker-Independent Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 13, no. 5, pp. 975-983, 2005) or convolutive in nature from changes in microphone type (e.g., a hand-held microphone or a hands-free microphone) or position relative to the speaker's mouth. Speaker-dependent characteristics, such as variations in vocal tract geometry also introduce mismatches. These mismatches tend to degrade the performance of an ASR system dramatically. In mobile ASR applications, these distortions occur routinely. Therefore, a practical ASR system needs to be able to operate successfully despite these distortions.
Hidden Markov models (HMMs) are widely used in the current ASR systems. The above distortion may affect HMMs by, for example, shift of mean vectors or additional biases to the pre-trained mean vectors. Many techniques have been developed in an attempt to compensate for these distortions. Generally, the techniques may be classified into two approaches: front-end techniques that recover clean speech from a noisy observation (see, e.g., Macho, et al., “Evaluation of a Noise-Robust DSR Front-End on Aurora Databases,” in ICSLP, 2002, vol. 1, pp. 17-20, Deng, et al., “Recursive Estimation of Nonstationary Noise Using Iterative Stochastic Approximation for Robust Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 11, no. 6, pp. 568-580, 2003, Moreno, et al., “A Vector Taylor Series Approach for Environment-Independent Speech Recognition,” in ICASSP, 1996, vol. 2, pp. 733-736, Hermansky, et al., “Rasta-PLP Speech Analysis Technique,” in ICASSP, 1992, pp. 121-124, Rahim, et al., “Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 1, pp. 19-30, January 1996, and Hilger, et al., “Quantile Based Histogram Equalization for Noise Robust Speech Recognition,” in EUROSPEECH, 2001, pp. 1135-1138) and back-end techniques that adjust model parameters to better match the distribution of a noisy speech signal (see, e.g., Gales, et al., “Robust Speech Recognition in Additive and Convolutional Noise Using Parallel Model Combination,” Computer Speech and Language, vol. 9, pp. 289-307, 1995, Sankar, et al., “A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition,” IEEE Trans, on Speech and Audio Processing, vol. 4, no. 3, pp. 190-201, 1996, Zhao, “Maximum Likelihood Joint Estimation of Channel and Noise for Robust Speech Recognition,” in ICASSP, 2000, vol. 2, pp. 1109-1113, Woodland, et al., “Improving Environmental Robustness in Large Vocabulary Speech Recognition,” in ICASSP, 1996, pp. 65-68, and Chou, “Maximum a Posterior Linear Regression based Variance Adaptation of Continuous Density HMMs,” Technical Report ALR-2002-045, Avaya Labs Research, 2002).
Many of these approaches are not suitable for current mobile devices due to their memory usage and/or power consumption requirements. Further, while some approaches may achieve good performance when the signal to noise ratio (SNR) is homogeneous from utterance to utterance, their performance is degraded when there is a dramatic environmental change across a sequence of utterances, e.g., significant SNR variation from previous utterance to current utterance. Accordingly, improvements in automatic speech recognition to make ASR systems in mobile devices more robust to channel and noise distortion are desirable.