This invention relates to speaker-independent speech recognition and more particularly to speaker-independent speech recognition in a noisy environment.
Speech recognition for matched conditions has achieved low recognition errors. The matched conditions is where the training and testing are performed in the same acoustic conditions. A word error rate (WER) of 1% has been reported for connected digits over a telephone network. Results such as this are achieved using a large amount of training data under conditions as close as possible to the testing conditions. It is highly desirable to provide speech recognition in a noisy environment. One such environment is hands-free speech recognition in a car. The microphone is often placed somewhere remote from the user such as in the corner of the windshield. The road noise, the wind noise, and the speaker""s remoteness from the microphone cause severe mismatch conditions for recognition. For such recognition tasks, a collection of large databases is required to train speaker-independent Hidden Markov Models (HMMs). This is very expensive. If HMMs are used in cross-condition recognition, such as using a close-talking microphone in a quiet office for training, and then testing on hands-free recognition in a car, the mismatch will degrade recognition performance substantially. In terms of power spectral density, the mismatch can be characterized by a linear filter and an additive noise: [Y(xcfx89)|=|H(xcfx89)|2.|X(xcfx89)|+|N(xcfx89)| where Y(xcfx89) represents the speech to be recognized, H(xcfx89) the linear filter, X(xcfx89) the training speech, and N(xcfx89) the noise. In the log spectral domain, this equation can be written as:
xe2x80x83log|Y(xcfx89)|=log|X(xcfx89)|+"psgr"(N(xcfx89),X(xcfx89),H(xcfx89))xe2x80x83xe2x80x83(1)
with                                           ψ            ⁡                          (                                                N                  ⁡                                      (                    ω                    )                                                  ,                                  X                  ⁡                                      (                    ω                    )                                                  ,                                  H                  ⁡                                      (                    ω                    )                                                              )                                ⁢                      Δ            _                    ⁢          log          ⁢                      xe2x80x83                    ⁢          log          ⁢                                    "LeftBracketingBar"                              H                ⁢                                  xe2x80x83                                ⁢                                  (                  ω                  )                                            "RightBracketingBar"                        2                          +                  log          (                      1            +                                          "LeftBracketingBar"                                  N                  ⁢                                      xe2x80x83                                    ⁢                                      (                    ω                    )                                                  "RightBracketingBar"                                                              "LeftBracketingBar"                                      X                    ⁢                                          xe2x80x83                                        ⁢                                          (                      ω                      )                                                        "RightBracketingBar"                                ·                                                      "LeftBracketingBar"                                          H                      ⁢                                              xe2x80x83                                            ⁢                                              (                        ω                        )                                                              "RightBracketingBar"                                    2                                                                                        (        2        )            
"psgr" can be used to characterize the mismatch, which depends on the linear filter, the noise source and the signal itself.
To overcome the mismatch, several types of solutions have been reported. For example, Cepstral Mean Normalization (CMN) is known for its ability to remove the first term xcfx89 (i.e., stationary bias) in cepstra. See, for example, S. Furui article, xe2x80x9cCepstral Analysis Technique for Automatic Speaker Verification,xe2x80x9d IEEE Trans. Acoustics, Speech and Signal Processing ASSP-29(2):254-272, April 1981. It has been shown that using CMN, telephone quality speech models can be trained with high quality speech. See article of L. G. Neumeyer, V. V. Digalakis, and M. Weintraub, xe2x80x9cTraining Issues and Channel Equalization Techniques for The Construction of Telephone Acoustic Models Using A High-Quality Speech Corpus,xe2x80x9d IEEE Trans. on Speech and Audio Processing, 2(4):590-597, October 1994. However, this is not effective for the second term, which is caused by additive noise and cannot be assumed constant within the utterance. Two-level CMN alleviates this problem by introducing a speech mean vector and a background mean vector. See, for example, S. K. Gupta, F. Soong, and R. Haimi-Cohen, High-Accuracy Connected Digit Recognition for Mobile Applications, in Proc. of IEEE Internat. Conf. on Acoustics, Speech And Signal Processing, pages 57-60, Atlanta, May 1996. Other, more detailed models of the mismatch include joint additive and convolutive bias compensation (see M. Afify, Y. Gong, and J.-P. Haton, xe2x80x9cA Unified Maximum Likelihood Approach to Acoustic Mismatch Compensation: Application to Noisy Lombard Speech Recognition,xe2x80x9d in Proc. Of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Germany, 1997) and channel and noise estimation. (See D. Matrouf and J. L. Gauvain article, xe2x80x9cModel Compensation for Noises in Training And Test Data,xe2x80x9d in Proc. Of IEEE Internat. Conf. On Acoustics, Speech and Signal Processing, Germany, 1997.)
In accordance with one embodiment of the present invention, an improved transformation method comprises providing an initial set of HMMs trained on a large amount of speech recorded in one condition, which provides rich information on co-articulation and speaker variation and a much smaller speech database collected in the target environment, which provides information on the test condition including channel, microphone, background noise and reverberation.