The present invention relates to speech processing. In particular, the present invention relates to speech enhancement.
In speech recognition, it is common to enhance the speech signal by removing noise before performing speech recognition. Under some systems, this is done by estimating the noise in the speech signal and subtracting the noise from the noisy speech signal. This technique is typically referred to as spectral subtraction because it is performed in the spectral domain.
Since it is impossible to estimate the noise in a speech signal perfectly, any estimate that is used in spectral subtraction will have some amount of error. Because of this error, it is possible that the estimate of the noise in the noisy speech signal will be larger than the noisy speech signal for some frames of the signal. This would produce a negative value for the “clean” speech, which is physically impossible.
To avoid this, spectral subtraction systems rely on a set of parameters that are set by hand to allow for maximum noise reduction while ensuring a stable system. Relying on such parameters is undesirable since they are typically noise-source dependent and thus must be hand-tuned for each type of noise-source.
Other systems attempt to enhance the speech signal using a Wiener filter to filter out the noise in the speech signal. In such systems, the gain of the Wiener filter is generally based on a signal-to-noise ratio. To arrive at the proper gain value, the level of the noise in the signal must be determined.
One common technique for determining the level of noise is to estimate the noise during non-speech segments in the speech signal. This technique is less than desirable because it not only requires a correct estimate of the noise during the non-speech segments, it also requires that the non-speech segments be properly identified as not containing speech. In addition, this technique depends on the noise being stationary (non-changing). If the noise is changing over time, the estimate of the noise will be wrong and the filter will not perform properly.
Another system for enhancing speech attempts to identify a clean speech signal using a probabilistic framework that provides a Minimum Mean Square Error (MMSE) estimate of the clean signal given a noisy speech signal. Unfortunately, such systems can provide poor estimates of the clean speech signal at times, especially when the signal-to-noise ratio is low. As a result, using the clean speech estimates directly in speech recognition can result in poor recognition accuracy.
Thus, a system is needed that does not require as much hand-tuning of parameters as in spectral subtraction while avoiding the poor estimates that sometimes occur in MMSE estimation.