We live in a noisy world. Environmental noise is everywhere, arising from natural sources as well as human activities. During voice communication, environmental noises are transmitted simultaneously with the intended speech signal, adversely effecting the quality of a received signal. This problem is mitigated by speech enhancement techniques that remove such unwanted noise components, thereby producing a cleaner and more intelligible signal.
Most speech enhancement systems rely on various forms of an adaptive filtering operation. Such systems attenuate the time/frequency (T/F) regions of the noisy speech signal having low Signal-to-Noise-Ratios (SNR) while preserving those with high SNR. The essential components of speech are thus preserved while the noise component is greatly reduced. Usually, such a filtering operation is performed in the digital domain by a computational device such as a Digital Signal Processing (DSP) chip.
Subband domain processing is one of the preferred ways in which such adaptive filtering operation is implemented. Briefly, the unaltered speech signal in the time domain is transformed to various subbands by using a filterbank, such as the Discrete Fourier Transform (DFT). The signals within each subband are subsequently suppressed to a desirable amount according to known statistical properties of speech and noise. Finally, the noise suppressed signals in the subband domain are transformed to the time domain by using an inverse filterbank to produce an enhanced speech signal, the quality of which is highly dependent on the details of the suppression procedure.
An example of a prior art speech enhancer is shown in FIG. 1. The input is generated by digitizing an analog speech signal that contains both clean speech as well as noise. This unaltered audio signal y(n), where n=0,1, . . . , ∞ is the time index, is then sent to an analysis filterbank device or function (“Analysis Filterbank”) 2, producing multiple subbands signals, Yk(m), k=1,. . . , K, m=0,1, . . . , ∞, where k is the subband number, and m is the time index of each subband signal. The subband signals may have lower sampling rates compared with y(n) due to the down-sampling operation in Analysis Filterbank 2. The noise level of each subband is then estimated by using a noise variance estimator device or function (“Noise Variance Estimator”) 4 with the subband signal as input. The Noise Variance Estimator 4 of the present invention differs from those known in the prior art and is described below, in particular with respect to FIGS. 2a and 2b. Based on the estimated noise level, appropriate suppression gains gk are determined in a suppression rule device or function (“Suppression Rule”) 6, and applied to the subband signals as follows:{tilde over (Y)}k(m)=gkYk(m), k=1, . . . , K.  (1)Such application of the suppression gain to a subband signal is shown symbolically by a multiplier symbol 8. Finally, {tilde over (Y)}k(m) are sent to a synthesis filterbank device or function (“Synthesis Filterbank”) 10 to produce an enhanced speech signal {tilde over (y)}(n). For clarity in presentation, FIG. 1 shows the details of generating and applying a suppression gain to only one of multiple subband signals (k).
The appropriate amount of suppression for each subband is strongly correlated to its noise level. This, in turn, is determined by the variance of the noise signal, defined as the mean square value of the noise signal with respect to a zero-mean Gaussian probability distribution. Clearly, an accurate noise variance estimation is crucial to the performance of the system.
Normally, the noise variance is not available, a priori, and must be estimated from the unaltered audio signal. It is well-known that the variance of a “clean” noise signal can be estimated by performing a time-averaging operation on the square value of noise amplitudes over a large time block. However, because the unaltered audio signal contains both clean speech and noise, such a method is not directly applicable.
Many noise variance estimation strategies have been previously proposed to solve this problem. The simplest solution is to estimate the noise variance at the initialization stage of the speech enhancement system, when the speech signal is not present (reference [1]). This method, however, works well only when the noise signal as well as the noise variance is relatively stationary.
For an accurate treatment of non-stationary noise, more sophisticated methods have been proposed. For example, Voice Activity Detection (VAD) estimators make use of a standalone detector to determine the presence of a speech signal. The noise variance is only updated during the time when it is not (reference [2]). This method has two shortcomings. First, it is very difficult to have reliable VAD results when the audio signal is noisy, which in turn affects the reliability of the noise variance estimation result. Secondly, this method precludes the possibility to update the noise variance estimation when the speech signal is present. The latter concern leads to inefficiency because the noise variance estimation can still be reliably updated during times wherein the speech level is weak.
Another widely quoted solution to this problem is the minimum statistics method (reference [3]). In principle, the method keeps a record of the signal level of historical samples for each subband, and estimates the noise variance based on the minimum recorded value. The rationale behind this approach is that the speech signal is generally an on/off process that naturally has pauses. In addition, the signal level is usually much higher when the speech signal is present. Therefore, the minimum signal level from the algorithm is probably from a speech pause section if the record is sufficiently long in time, yielding a reliable estimated noise level. Nevertheless, the minimum statistics method has a high memory demand and is not applicable to devices with limited available memory.