In the present application we refer to environment as speaker, handset or microphone, transmission channel, noise background conditions, or combination of these as the environment. A speech signal can only be measured in a particular environment. Speech recognizers suffer from environment variability because trained model distributions may be biased from testing signal distributions because environment mismatch and trained model distributions are flat because they are averaged over different environments.
The first problem, the environmental mismatch, can be reduced through model adaptation, based on some utterances collected in the testing environment. To solve the second problem, the environmental factors should be removed from the speech signal during the training procedure, mainly by source normalization.
In the direction of source normalization, speaker adaptive training uses linear regression (LR) solutions to decrease inter-speaker variability. See for example, T. Anastasakos, et al. entitled, “A compact model for speaker-adaptive training,” International Conference on Spoken Language Processing, Vol. 2, October 1996. Another technique models mean-vectors as the sum of a speaker-independent bias and a speaker-dependent vector. This is found in A. Acero, et al. entitled, “Speaker and Gender Normalization for Continuous-Density Hidden Markov Models,” in Proc. Of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 342–345, Atlanta, 1996. Both of these techniques require explicit label of the classes. For example, speaker or gender of the utterance during the training. Therefore, they can not be used to train clusters of classes, which represent acoustically close speaker, hand set or microphone, or background noises. Such inability of discovering clusters may be a disadvantage in application.