The present invention generally pertains to pattern recognition systems. More particularly, the present invention pertains to methods for reducing the adverse impact of noise on signals utilized within speech recognition systems.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal and identify an incorporated pattern. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
To decode the incoming text signal, most recognition systems utilize one or more models that describe the likelihood that a portion of the test signal represents a particular pattern. Examples of such models include Neural Nets, Dynamic Time Warping, segment models, and Hidden Markov Models.
Before a model is used to decode an incoming signal, it is trained. Training is typically accomplished by measuring input training signals generated from a known training pattern. For example, in speech recognition, it is common for speakers reading from a known text generate a collection of speech signals. These speech signals are then used to train the models.
In order for the models to work optimally, the signals used to train the model should be similar to the eventual test signals that are decoded. In particular, the training signals should have the same amount and type of noise as the test signals that are decoded.
Typically, the training signal is collected under “clean” conditions and is considered to be relatively noise free. To achieve this same low level of noise in the test signal, many prior art systems apply noise reduction techniques to the testing data. Automatic speech recognition systems without explicit provisions for noise robustness have proven to degrade quickly in the presence of additive noise.
Thus, how to best add noise robustness to speech recognition systems is an area of active research. There are many examples of model based feature enhancement systems. Many such systems include a model for speech, and often a model for noise as well, within an enhancement algorithm. Most techniques incorporate either Gaussian mixture models or hidden Markov models.
When the clean speech model is a Gaussian mixture model (GMM), each frame of data is enhanced independently. Without post-processing, this can result in artifacts, such as sharp single frame transitions, that were not part of the original clean speech signal.
Choosing a hidden Markov model (HMM) for the clean speech model introduces some time dependencies in the enhancement process. Although, for any given state sequence, the enhancement process is the same as for a GMM, the state transition probabilities of the HMM tend to eliminate single frame errors in the output. State transitions, however, can still produce edge artifacts, so post-processing is still generally necessary.