1. Field
The following description relates to speech recognition, and more specifically, to a technology for improving speech recognition performance in noisy environments.
2. Description of Related Art
Speech recognition techniques using statistical patterns are in wide use in the field of speech recognition. However, performance of said techniques lessen due to multiple factors, a main factor being that in speech recognition performance based on statistical patterns, there is a difference in acoustic features between a speech signal used in acoustic model training and an actual speech signal that is input in the real environment. For example, during speech recognition, various background noises (i.e., car noises, music, etc.) of the real environment may be registered with the input speech signal, whereby the input speech signal has different acoustic features from the speech signal used in model training. To reduce such discrepancies in acoustic features, speech enhancement, feature compensation, and model adaptation are used.
Speech recognition based on the feature compensation, which is classified into data-driven compensation and model-based compensation, may be inferior to speech recognition based on the model adaption; however, with only a small amount of computation, the feature compensation can be flexibly applied to new speech recognition environments.
Typical model-based speech feature compensation represents a distribution of speech features as a Gaussian mixture model (GMM). This method, however, cannot utilize temporal dynamics of adjacent speech frames, which is one of the most critical features that distinguish a speech signal from a noise signal. This may degrade speech recognition performance in an environment where there is background noise, such as babble noise or TV noise. The extended Kalman filter, used in noise feature estimation, exhibits superior performance in estimation of non-stationary noise features that gradually change over time. However, said filter uses features of a current frame, and hence an uncorrelated assumption may prove inaccurate, or observation model errors may occur. Accordingly, noise feature estimation would be inaccurate, and particularly, this incorrect noise feature estimation in a speech interval may lead a poor performance of speech recognition.