The present invention relates to noise reduction. In particular, the present invention relates to removing noise from signals used in speech recognition.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal (often referred to as a test signal) is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
To decode the incoming test signal, most recognition systems utilize one or more models that describe the likelihood that a portion of the test signal represents a particular pattern. Examples of such models include Neural Nets, Dynamic Time Warping, segment models, and Hidden Markov Models.
Before a model can be used to decode an incoming signal, it must be trained. This is typically done by measuring input training signals generated from a known training pattern. For example, in speech recognition, a collection of speech signals is generated by speakers reading from a known text. These speech signals are then used to train the models.
In order for the models to work optimally, the signals used to train the models should be similar to the eventual test signals that are decoded. In particular, the training signals should have the same amount and type of noise as the test signals that are decoded.
Typically, the training signal is collected under “clean” conditions and is considered to be relatively noise free. To achieve this same low level of noise in the test signal, many prior art systems apply noise reduction techniques to the testing data.
One technique for removing noise attempts to model the noise using a set of training signals collected under various noisy conditions. However, such systems are only effective if the noise conditions of the training signals match the noise conditions of the test signals. Because of the large number of possible noises and the seemingly infinite combinations of noises, it is very difficult to build noise models that can handle every test condition.
Another technique for removing noise is to estimate the noise and then subtract it from the noisy speech signal. Typically, such systems estimate the noise from a previous frame of the input signal. As such, if the noise is changing over time, the estimate of the noise for the current frame will be inaccurate.
One system of the prior art for estimating the noise in a speech signal uses the harmonics of human speech. The harmonics of human speech produce peaks in the frequency spectrum. By identifying nulls between these peaks, these systems identify the spectrum of the noise. This spectrum is then subtracted from the spectrum of the noisy speech signal to provide a clean speech signal.
The harmonics of speech have also been used in speech coding to reduce the amount of data that must be sent when encoding speech for transmission across a digital communication path. Such systems attempt to separate the speech signal into a harmonic component and a random component. Each component is then encoded separately for transmission. One system in particular used a harmonic+noise model in which a sum-of-sinusoids model is fit to the speech signal to perform the decomposition.
In speech coding, the decomposition is done to find a parameterization of the speech signal that accurately represents the input noisy speech signal. The decomposition has no noise-reduction capability and as such has not been useful in speech recognition.