A speech recognizer trained on clean speech data and operating in different environments has lower performance due to at least the two distortion sources of background noise and microphone or channel changes. Handling simultaneously the two is critical to the performance of the recognizer.
There are many front-end solutions that have been developed and have shown to give promising results for connected digit recognition applications in very noisy environments. See references 2 and 3. For instance, there are methods such as ETSI advanced DSR front-end that handles both channel distortion and background noise. See D. Macho, L. Mauuary, B. Noe, Y. M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, and F. Saadoun, “Evaluation of a Noise-robust DSR front-end on AURORA databases,” Proc. Int. Conf. on Spoken Language Processing, Colorado, UAS, September 2002, pp 17–20. These techniques do not require any noise training data. To be effective in noise reduction, they typically require an accurate instantaneous estimate of the noise spectrum.
Alternate solutions consist, instead, of modifying the back-end of the recognizer to compensate for the mismatch between the training and recognition environments. More specifically, in the acoustic model space, a convolutive (e.g. channel) component and an additive (e.g. background noise) component can be introduced to model the two distortion sources. See the following references: M. Afifty, Y. Gong, and J. P. Haton, “A general joint additive and convolutive bias compensation approach applied to noisy Lombard speech recognition,” IEEE Trans. On Speech and Audio Processing, vol. 6, no. 6, pp 524–538, November 1998; J. L. Gauvain, L. Lamel, M. Adda-Decker, and D. Matrouf, “Developments in continuous speech dictation using the ARPA NAB news task, ” in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Detroit, 1996, pp 73–76; Y. Minami and S. Furui, “A maximum likelihood procedure for a universal adaptation method based on HMM composition,” in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Detroit, 1995, pp 129–132; M. J. F. Gales, Model-Based Techniques for Noise Robust Speech Recognition, Ph.D. thesis, Cambridge University, U.K., 1995; Y. Gong, “A Robust continuous speech recognition system for mobile information devices (invited paper),” in Proc. Of International Workshop on Hands-Free Speech Communication, Kyoto, Japan, April 2001; and Y. Gong, “Model-space compensation of microphone and noise for speaker-independent speech recognition,” in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Hong Kong, April 2003. The effect of the two distortions introduces in the log spectral domain non-linear parameter changes, which can be approximated by linear equations. See S. Sagayama, Y. Yamaguchi, and S. Takahashi, “Jacobian adaptation of noisy speech models,” in Proc. of IEEE Automatic Speech Recognition Workshop, Santa Barbara, Calif., USA, December 1997, pp 396–403, IEEE Signal Processing Society and N. S. Kim, “Statistical linear approximation for environmental compensation, IEEE Signal Processing Letters, vol. 5, no. 1, pp 8–11, January 1998.
It is desirable to provide a more robust utilization of a framework recently developed by Texas Instruments Incorporated known as JAC (Joint compensation of Additive and Convolutive distortions). This is described in patent application Ser. No. 10/251,734; filed Sep. 20, 2002 of Yifan Gong entitled “Method of Speech Recognition Resistant to Convolutive Distortion and Additive Distortion.” JAC handles simultaneously both background noise and channel distortions for speaker independent speech recognition. Joint additive acoustic noise and convolutive channel noise compensating algorithms for improved speech recognition in noise are not able to operate at very low signal to noise ratios (SNR). This application is incorporated herein by reference. The reason lies in the fact that when the compensation mechanism of the recognizer is suddenly exposed to a new type of channel noise or to a very low SNR signal, inaccurate channel estimates or insufficient background noise compensation will degrade the quality of the subsequent channel estimate, which in turn will degrade recognition accuracy and channel estimate of the next sentence exposed to the recognizer.