1. Field of the Invention
The present invention relates generally to a method and an apparatus for enhancing noise-corrupted speech through noise suppression. More particularly, the invention is directed to improving the speech quality of a noise suppression system employing a spectral subtraction technique.
2. Description of the Related Art
With the advent of digital cellular telephones, it has become increasingly important to suppress noise in solving speech processing problems, such as speech coding and speech recognition. This increased importance results not only from customer expectation of high performance even in high car noise situations, but also from the need to move progressively to lower data rate speech coding algorithms to accommodate the ever-increasing number of cellular telephone customers.
The speech quality from these low-rate coding algorithms tends to degrade drastically in high noise environments. Although noise suppression is important, it should not introduce undesirable artifacts, speech distortions, or significant loss of speech intelligibility. Many researchers and developers have attempted to achieve these performance goals for noise suppression for many years, but these goals have now come to the forefront in the digital cellular telephone application.
In the literature, a variety of speech enhancement methods potentially involving noise suppression have been proposed. Spectral subtraction is one of the traditional methods that has been studied extensively. See, e.g., Lim, xe2x80x9cEvaluations of Correlation Subtraction Method for Enhancing Speech Degraded by Additive White Noise,xe2x80x9d IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 26, No. 5, pp. 471-472 (1978); and Boll, xe2x80x9cSuppression of Acoustic Noise in Speech Using Spectral Subtraction,xe2x80x9d IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 27, No. 2, pp. 113-120 (April, 1979). Spectral subtraction is popular because it can suppress noise effectively and is relatively straightforward to implement.
In spectral subtraction, an input signal (e.g., speech) in the time domain is converted initially to individual components in the frequency domain, using a bank of band-pass filters, typically, a Fast Fourier Transform (FFT). Then, the spectral components are attenuated according to their noise energy.
The filter used in spectral subtraction for noise suppression utilizes an estimate of power spectral density of the background noise, thereby generating a signal-to-noise ratio (SNR) for the speech in each frequency component. Here, the SNR means a ratio of the magnitude of the speech signal contained in the input signal, to the magnitude of the noise signal in the input signal. The SNR is used to determine a gain factor for a frequency component based on a SNR in the corresponding frequency component. Undesirable frequency components then are attenuated based on the determined gain factors. An inverse FFT recombines the filtered frequency components with the corresponding phase components, thereby generating the noise-suppressed output signal in the time domain. Usually, there is no change in the phase components of the signal because the human ear is not sensitive to such phase changes.
This spectral subtraction method can cause so-called xe2x80x9cmusical noise.xe2x80x9d The musical noise is composed of tones at random frequencies, and has an increased variance, resulting in a perceptually annoying noise because of its unnatural characteristics. The noise-suppressed signal can be even more annoying than the original noise-corrupted signal.
Thus, there is a strong need for techniques for reducing musical noise. Various researchers have proposed changes to the basic spectral subtraction algorithm for this purpose. For example, Berouti et al., xe2x80x9cEnhancement of Speech Corrupted by Acoustic Noise,xe2x80x9d Proc. IEEE ICASSP, pp. 208-211 (April, 1979) relates to clamping the gain values at each frequency so that the values do not fall below a minimum value. In addition, Berouti et al. propose increasing the noise power spectral estimate artificially, by a small margin. This is often referred to as xe2x80x9coversubtraction.xe2x80x9d
Both clamping and oversubtraction are directed to reducing the time varying nature associated with the computed gain modification values. Arslan et al., xe2x80x9cNew Methods for Adaptive Noise Suppression,xe2x80x9d Proc. IEEE ICASSP, pp. 812-815 (May, 1995), relates to using smoothed versions of the FFT-derived estimates of the noisy speech spectrum, and the noise spectrum, instead of using the FFT coefficient values directly. Tsoukalas et al., xe2x80x9cSpeech Enhancement Using Psychoacoustic Criteria,xe2x80x9d Proc. IEEE ICASSP, pp. 359-362 (April, 1993), and Azirani et al., xe2x80x9cOptimizing Speech Enhancement by Exploiting Masking Properties of the Human Ear,xe2x80x9d Proc. EEE ICASSP, pp. 800-803 (May, 1995), relate to psychoacoustic models of the human ear.
Clamping and oversubtraction significantly reduce musical noise, but at the cost of degraded intelligibility of speech. Therefore, a large degree of noise reduction has tended to result in low intelligibility. The attenuation characteristics of spectral subtraction typically lead to a de-emphasis of unvoiced speech and high frequency formants, thereby making the speech sound muffled.
There have been attempts in the past to provide spectral subtraction techniques without the musical noise, but such attempts have met with limited success. See, e.g., Lim et al., xe2x80x9cAll-Pole Modeling of Degraded Speech,xe2x80x9d IEEE Trans. Acoustic, Speech and Signal Processing, Vol. 26, pp. 197-210 (June, 1978); Ephraim et al., xe2x80x9cSpeech Enhancement Using a Minimum Mean Square Error Short-Time Spectral Amplitude Estimator,xe2x80x9d IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 32, pp. 1109-1120 (1984); and McAulay et al., xe2x80x9cSpeech Enhancement Using a Soft-Decision Noise Suppression Filter,xe2x80x9d IEEE Trans. Acoustic, Speech and Signal Processing, Vol. 28, pp. 137-145 (April, 1980).
In spectral subtraction techniques, the gain factors are adjusted by SNR estimates. The SNR estimates are determined by the speech energy in each frequency component, and the current background noise energy estimate in each frequency component. Therefore, the performance of the entire noise suppression system depends on the accuracy of the background noise estimate. The background noise is estimated when only background noise is present, such as during pauses in human speech. Accordingly, spectral subtraction with high precision requires an accurate and robust speech/noise discrimination, or voice activity detection, in order to determine when only noise exists in the signal.
Existing voice activity detectors utilize combinations of energy estimation, zero crossing rate, correlation functions, LPC coefficients, and signal power change ratios. See, e.g., Yatsuzuka, xe2x80x9cHighly Sensitive Speech Detector and High-Speed Voiceband Data Discriminator in DSI-ADPCM Systems,xe2x80x9d IEEE Trans. Communications, Vol 30, No. 4 (April, 1982); Freeman et al., xe2x80x9cThe Voice Activity Detector for the Pan-European Digital Cellular Mobile Telephone Service,xe2x80x9d IEEE Proc. ICASSP, pp. 369-372 (February, 1989); and Sun et al., xe2x80x9cSpeech Enhancement Using a Ternary-Decision Based Filter,xe2x80x9d IEEE Proc. ICASSP, pp. 820-823 (May, 1995).
However, in very noisy environments, speech detectors based on the above-mentioned approaches may suffer serious performance degradation. In addition, hybrid or acoustic echo, which enters the system at significantly lower levels, may corrupt the noise spectral density estimates if the speech detectors are not robust to echo conditions.
Furthermore, spectral subtraction assumes noise source to be statistically stationary. However, speech may be contaminated by color non-stationary noise, such as the noise inside a compartment of a running car. The main sources of the noise are an engine and the fan at low car speeds, or the road and wind at higher speeds, as well as passing cars. These non-stationary noise sources degrade performance of speech enhancement systems using spectral subtraction. This is because the non-stationary noise corrupts the current noise model, and causes the amount of musical noise artifacts to increase. Recent attempts to solve this problem using Kalman filtering have reduced, but not eliminated, the problems. See, Lockwood et al., xe2x80x9cNoise Reduction for Speech Enhancement in Cars: Non-Linear Spectral Subtraction/Kalman Filtering,xe2x80x9d EUROSPEECH91, pp. 83-86 (September, 1991).
Therefore, a strong need exists for an improved acoustic noise suppression system that solves problems such as musical noise, background noise fluctuations, echo noise sources, and robust noise classification.
These and other problems are overcome by the present invention, which has an object of providing a method and apparatus for enhancing noise-corrupted speech.
A system for enhancing noise-corrupted speech according to the present invention includes a framer for dividing the input audio signal into a plurality of frames of signals, and a pre-filter for removing the DC-component of the signal as well as alter the minimum phase aspect of speech signals.
A multiplier multiplies a combined frame of signals to produce a filtered frame of signals, wherein the combined frame of signals includes all signals in one filtered frame of signals combined with some signals in the filtered frame of signals immediately preceding in time the one filtered frame of signals. A transformer obtains frequency spectrum components from the windowed frame of signals. A background noise estimator uses the frequency spectrum components to produce a noise estimate of an amount of noise in the frequency spectrum components.
A noise suppression spectral modifier produces gain multiplicative factors based on the noise spectral estimate and the frequency spectrum components. A controlled attenuator attenuates the frequency spectrum components based on the gain multiplication factors to produce noise-reduced frequency components, and an inverse transformer converts the noise-reduced frequency components to the time-domain. The time domain signal is further gain modified to alter the signal level such that the peaks of the signal are at the desired output level.
More specifically, the first aspect of the present invention employs a voice activity detector (VAD) to perform the speech/noise classification for the background noise update decision using a state machine approach. In the state machine, the input signal is classified into four states: Silence state, Speech state, Primary Detection state, and Hangover state. Two types of flags are provided for representing the state transitions of the VAD. Short term energy measurements from the current frame and from noise frames are used to compute voice metrics.
A voice metric is a measurement of the overall voice like characteristics of the signal energy. Depending on the values of these voice metrics, the flags"" values are determined which then determine the state of the VAD. Updates to the noise spectral estimate are made only when the VAD is in the Silence state.
Furthermore, when the present invention is placed in a telephone network, the reverse link speech may introduce echo if there is a 2/4-wire hybrid in the speech path. In addition, end devices such as speakerphones could also introduce acoustic echoes. Many times the echo source is of sufficiently low level as not to be detected by the forward link VAD. As a result, the noise model is corrupted by the non-stationary speech signal causing artifacts in the processed speech. To prevent this from happening, the VAD information on the reverse link is also used to control when updates to the noise spectral estimates are made. Thus, the noise spectral estimate is only updated when there is silence on both sides of the conversation.
The second aspect of the present invention pertains to providing a method of determining the power spectral estimates based upon the existence or non-existence of speech in the current frame. The frequency spectrum components are altered differently depending on the state of the VAD. If the VAD state is in the Silence state, then frequency spectrum components are filtered using a broad smoothing filter. This help reduce the peaks in the noise spectrum caused by the random nature of the noise. On the other hand, if the VAD State is the Speech state, then one does not wish to smooth the peaks in the spectrum because these represent voice characteristics and not random fluctuations. In this case, the frequency spectrum components are filtered using a narrow smoothing filter.
One implementation of the present invention includes utilizing different types of smoothing or filtering for different signal characteristics (i.e., speech and noise) when using an FFT-based estimation of the power spectrum of the signal. Specifically, the present invention utilizes at least two windows having different sizes for a Wiener filter based on the likelihood of the existence of speech in the current frame of the noise-corrupted signal. The Wiener filter uses a wider window having a larger size (e.g., 45) when a voice activity detector (VAD) decides that speech does not exist in the current frame of the inputted speech signal. This reduces the peaks in the noise spectrum caused by the random nature of the noise. On the other hand, the Wiener filter uses a narrower window having a smaller size (e.g., 9) when the VAD decides that speech exists in the current frame. This retains the necessary speech information (i.e., peaks in the original speech spectrum) unchanged, thereby enhancing the intelligibility.
This implementation of the present invention reduces variance of the noise-corrupted signal when only noise exists, thereby reducing the noise level, while it keeps variance of the noise-corrupted signal when speech exists, thereby avoiding muffling of the speech.
Another implementation of the present invention includes smoothing coefficients used for the Wiener filter before the filter performs filtering. Smoothing coefficients are applicable to any form of digital filters, such as a Wiener filter. This second implementation keeps the processed speech clear and natural, and also avoids the musical noise.
These two implementations of the invention contribute to removing noise from speech signals without causing annoying artifacts such as xe2x80x9cmusical noise,xe2x80x9d and keeping the fidelity of the original speech high.
The third aspect of the present invention provides a method of processing the gain modification values so as to reduce musical noise effects at much higher levels of noise suppression. Random time-varying spikes and nulls in the computed gain modification values cause musical noise. To remove these unwanted artifacts a smoothing filter also filters the gain modification values.
The fourth aspect of the present invention provides a method of processing the gain modification values to adapt quickly to non-stationary narrow-band noise such as that found inside the compartment of a car. As other cars pass, the assumption of a stationary noise source breaks down and the passing car noise causes annoying artifacts in the processed signal. To prevent these artifacts from occurring the computed gain modification values are altered when noises such as passing cars are detected.