The present invention relates generally to speech recognition. More particularly, the invention relates to speaker adaptation in noisy environments.
Speech recognition systems may be classified into two groups: speaker independent and speaker dependent. Typically, the speaker independent system is constructed based on a corpus of training data from a plurality of speakers and the speaker dependent system is constructed using a process called speaker adaptation, whereby the speech models of a speaker independent system are adapted to work better for a particular new speaker. Speaker adaptation often involves the problem of how to estimate reliable models from small amounts of adaptation data from the new speaker. When adapting a speaker independent system to a speaker dependent one, the enrolling user provides an initial quantity of enrollment speech (adaptation speech) from which the adapted models are constructed. Because providing enrollment speech takes time, users prefer systems that will adapt with minimal training or that are capable of adapting on the fly as the system is being used.
There are numerous different speaker adaptation techniques in popular use today. They include maximum likelihood linear regression (MLLR) and maximum a posteriori (MAP) estimation. Generally, adaptation techniques such as these are successful when applied under low noise conditions. However, the techniques begin to fail as the background noise level increases.
We believe that one reason adaptation systems fail is that the speaker adaptation processes ignore information about the environment model. Thus when enrollment speech is provided in the presence of background noise, the adaptation system will attempt to compensate for both the enrolling speaker's speech and the background noise. Because the background noise may vary unpredictably, the resulting adapted models are likely to work very poorly in practice.
The present invention solves this problem by utilizing a special linear approximation of the background noise that is applied after feature extraction and prior to speaker adaptation to allow the speaker adaptation system to adapt the speech models to the enrolling user without distortion from the background noise. Notably, the technique works in the extracted feature domain. That is linear approximation of the background noise is applied in the feature domain (e.g., in the cepstral domain, or other statistical domain) rather than in the time domain associated with the input enrollment utterance. The presently preferred embodiment uses a Jacobian matrix to implement the linear approximation of the background noise. Other linear approximations may be used in the alternative.
For a more complete understanding of the invention, its objects and advantages, refer to the following written description and to the accompanying drawings.