Contemporary speech recognizers operate by having a large number of Gaussian distributions or the like. When audio corresponding to an utterance is input, the recognizer finds the best matching distributions based on training data, and uses those distributions to determine the words of the utterance.
As is well known, speech recognition systems tend to perform poorly in a noisy environment. One reason that speech recognizers fail in noisy environments is that the environmental conditions present in deployment differ from those seen in the training data.
Various compensation techniques have been attempted to reduce the mismatch between training and testing conditions and thus improve recognition accuracy. Generally there are two types of techniques, namely feature compensation techniques and model compensation techniques.
In feature compensation, the captured signal or the features extracted from the signal are processed prior to recognition to mitigate the effect of noise. These techniques are computationally efficient and do not require changes to the recognizer itself. However, they have the drawback that they make point estimates of the enhanced speech features, and errors in this estimation can cause further mismatch to the recognizer's acoustic models, further degrading performance.
Model compensation techniques avoid this problem by directly adapting the distributions inside the recognizer to better match the current environmental conditions. Such techniques may operate in a data driven fashion, although faster performance is typically achieved by methods that exploit the known relationship between clean speech, noise, and the resulting noisy speech.
However, model compensation is a challenging problem because the features that characterize these three quantities (clean speech, noise and noisy speech) are related nonlinearly. One option is to digitally mix the noise with the clean speech to produce noisy speech, and retrain the recognizer from scratch with the noisy speech. This results in improved accuracy, but is slow in computation time, and thus approximations have been attempted that are much faster to compute
Several different approximation methods for handling this nonlinearity have been proposed. For example, Monte Carlo sampling has been used to generate samples from the constituent speech and noise distributions, which are then used to estimate the parameters of the resulting distribution of noisy speech. In Vector Taylor Series (VTS) adaptation, the nonlinear function that describes noisy speech features as a function of the clean speech and noise features is linearized around expansion points defined by the speech and noise models. Other model compensation techniques have been used; however regardless of which technique is used, there is still room for improvement.