The present invention relates to speech processing. In particular, the present invention relates to speech enhancement.
In speech recognition, it is common to condition the speech signal to remove noise and portions of the speech signal that are not helpful in decoding the speech into text. For example, it is common to apply a frequency-based transform to the speech signal to reduce certain frequencies in the signal that do not aid in decoding the speech signal. One common frequency-based transform is known as a Mel-Scale transform that reduces pitch harmonics in the speech signal. Mel-Scale transforms are used because the pitch at which someone speaks does not affect the listener's ability to discern what is being said. By removing these harmonics, smaller speech models can be constructed because they do not have to be trained to decode speech at different pitches. Instead, the Mel-scale transform creates pitch-independent models that can be used to decode speech of any pitch.
Speech systems also attempt to enhance the speech signal by removing noise before performing speech recognition. Under some systems, this is done in the time domain by applying a noise filter to the speech signal. In other systems, this enhancement is performed using a two-stage process in which the pitch of the speech is first tracked using a pitch tracker and then the pitch is used to separate the speech signal from the noise. For various reasons, such two-stage processing is undesirable.
A third system for removing noise from a speech signal attempted to identify a clean speech signal in a noisy signal using a probabilistic framework that provided a Minimum Mean Square Error (MMSE) estimate of the clean signal given a noisy signal. This system was designed for speech recognition and as such relied on feature vectors that were appropriate for speech recognition. In particular, this probabilistic system used speech vectors that were produced using the Mel-scale transform.
Although this probabilistic system did not require two-stage processing, it was less than ideal for speech enhancement because the Mel-Scale transform removed information from the signal. Because of this loss of information, it is extremely difficult, if not impossible, to reconstruct a speech signal from the “cleaned” signal that humans can easily understand.
Thus, the current systems for enhancing speech are less than ideal since they either require a two-stage process or make it impossible to reconstruct a clean intelligible speech signal.