This invention relates to speech recognition and, more particularly, to feature extraction for noisy speech.
A speech recognizer operates with a suitable front-end which typically provides a periodical feature vector every 10-20 milliseconds. The feature vector typically comprises mel-frequency cepstral coefficients (MFCC). See, for example, S. B. Davis and P. Mermelstein, xe2x80x9cComparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences,xe2x80x9d IEEE Transaction Acoustics, Speech and Signal Processing, ASSP-28(4) 357-366, August 1980.
Operating in acoustic environments under noisy background, the feature vectors such as MFCC will result in noisy features and dramatically degrade the speech recognition. See, for example, applicant""s article (Y. Gong) entitled xe2x80x9cSpeech Recognition in Noisy Environments: A Survey,xe2x80x9d Speech Communication, 16 (3): 261-291, April 1995.
It is highly desirable to provide a method or apparatus to reduce the error rate based on a noisy feature vector. Earlier work in this direction to reduce the error rate includes vector smoothing. See H. Hattori and S. Sagayama entitled xe2x80x9cVector Field Smoothing Principle for Speaker Adaptation,xe2x80x9d International Conference on Spoken Language Processing, Vol. 1, pp. 381-384, Banff, Alberta, Canada, October 1992. Another earlier work in the same direction is statistical mapping as described by Y. M. Cheng, D. O""Shaughnessy and P. Mermelstein entitled xe2x80x9cStatistical Signal Mapping: A General Tool for Speech Processing,xe2x80x9d in Proc. of 6th IEEE Workshop on Statistical Signal and Array Processing, pp. 436-439, 1992. Another work in the same direction is base transformation described by W. C. Treurniet and Y. Gong in xe2x80x9cNoise Independent Speech Recognition for a Variety of Noise Types,xe2x80x9d in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing,xe2x80x9d Vol. 1, pp. 437-440, Adelaide, Australia, April 1994. Another work in the same direction is code-dependent cepstral normalization described by A. Acero in xe2x80x9cAcoustical and Environmental Robustness in Automatic Speech Recognition,xe2x80x9d Kluwer Academic Publishers, 1993.
All of these prior art works require noisy speech database to train the relationships and they assume that the parameters of the noise encountered in the future will be a replica of that of the noise in the training data.
In practical situations such as speech recognition in automobiles, the noise statistics change from utterance to utterance, the chance that noisy speech database that adequately represents the future environment is low. Therefore, the applicability of any method based on training on noisy speech data is limited.
In accordance with one embodiment of the present invention, a method to obtain an estimate of clean speech is provided wherein one Gaussian mixture is trained on clean speech and a second Gaussian mixture is derived from the Gaussian mixture using some noise samples.
The present method exploits mapping noisy observation back to its clean correspondent using relationships between noise feature space and clean feature space.