This invention relates to modeling speech for speech recognition and more particularly to provide speech recognition in a noisy environment.
Automatic speech recognizers exhibit rapid degradation in performance when there is a mismatch between training and testing conditions. This mismatch can be caused by speaker variability, additive acoustic environmental noise and convolutive distortions due to the use of different telephone channels. All these variabilities are also present in an automobile environment and this degrades the performance of speech recognizers when used in an automobile.
Several techniques have been proposed to improve the robustness of speech recognizers under mismatch conditions (Y. Gong, xe2x80x9cSpeech Recognition in Noisy Environments: A Survey,xe2x80x9d Speech Communication, 16(3): 261-291, April 1995). These techniques fall under the following two main categories:
feature pre-processing techniques such as spectral subtraction, cepstral mean normalization (CMN), which aim at modifying the corrupted features so that the resulting features are closer to those of clean speech.
model adaptation techniques such as maximum likelihood linear regression (C. J. Leggetter and P. C. Woodland, xe2x80x9cMaximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs,xe2x80x9d Computer, Speech and Language, 9(2): 171-185, 1995), maximum a posteriori (IAP) estimation (J. L. Gauvain and C. H. Lee, xe2x80x9cMaximum A Posteriori Estimation for Multivariate Gaussian Observations of Markov Chains,xe2x80x9d IEEE Trans. on Speech and Audio Processing, 2(2): 291-298, April 1994), parallel model combination (PMC) (M. J. F. Gales, xe2x80x9cxe2x80x98NICExe2x80x99 Model-Based Compensation Schemes for Robust Speech Recognition,xe2x80x9d Proc. ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, pages 55-64, April 1997), in which model parameters of the corrupted speech model are estimated, to account for the mismatch.
In accordance with one embodiment of the present invention a two-stage model adaptation scheme is provided wherein the first stage adapts speaker-independent HMM (Hidden Markov Model) seed model set to a speaker and microphone dependent model set and in the second stage the speaker and microphone dependent model set is adapted to a speaker and noise-dependent model set.