The present invention relates to noise reduction. In particular, the present invention relates to removing noise from speech signals.
A common problem in speech recognition and speech transmission is the corruption of the speech signal by additive noise. In particular, corruption due to the speech of another speaker has proven to be difficult to detect and/or correct.
One technique for removing noise attempts to model the noise using a set of noisy training signals collected under various conditions. These training signals are received before a test signal that is to be decoded or transmitted and are used for training purposes only. Although such systems attempt to build models that take noise into consideration, they are only effective if the noise conditions of the training signals match the noise conditions of the test signals. Because of the large number of possible noises and the seemingly infinite combinations of noises, it is very difficult to build noise models from training signals that can handle every test condition.
Another technique for removing noise is to estimate the noise in the test signal and then subtract it from the noisy speech signal. Typically, such systems estimate the noise from previous frames of the test signal. As such, if the noise is changing over time, the estimate of the noise for the current frame will be inaccurate.
One system of the prior art for estimating the noise in a speech signal uses the harmonics of human speech. The harmonics of human speech produce peaks in the frequency spectrum. By identifying nulls between these peaks, these systems identify the spectrum of the noise. This spectrum is then subtracted from the spectrum of the noisy speech signal to provide a clean speech signal.
The harmonics of speech have also been used in speech coding to reduce the amount of data that must be sent when encoding speech for transmission across a digital communication path. Such systems attempt to separate the speech signal into a harmonic component and a random component. Each component is then encoded separately for transmission. One system in particular used a harmonic+noise model in which a sum-of-sinusoids model is fit to the speech signal to perform the decomposition.
In speech coding, the decomposition is done to find a parameterization of the speech signal that accurately represents the input noisy speech signal. The decomposition has no noise-reduction capability.
Recently, a system has been developed that attempts to remove noise by using a combination of an alternative sensor, such as a bone conduction microphone, and an air conduction microphone. This system is trained using three training channels: a noisy alternative sensor training signal, a noisy air conduction microphone training signal, and a clean air conduction microphone training signal. Each of the signals is converted into a feature domain. The features for the noisy alternative sensor signal and the noisy air conduction microphone signal are combined into a single vector representing a noisy signal. The features for the clean air conduction microphone signal form a single clean vector. These vectors are then used to train a mapping between the noisy vectors and the clean vectors. Once trained, the mappings are applied to a noisy vector formed from a combination of a noisy alternative sensor test signal and a noisy air conduction microphone test signal. This mapping produces a clean signal vector.
This system is less than optimum when the noise conditions of the test signals do not match the noise conditions of the training signals because the mappings are designed for the noise conditions of the training signals.