We live in a noisy world. Environmental noise is everywhere, arising from natural sources as well as human activities. During voice communication, environmental noises are transmitted simultaneously with the intended speech signal, adversely effecting reception quality. This problem is mitigated by speech enhancement techniques that remove such unwanted noise components, thereby producing a cleaner and more intelligible signal.
Most speech enhancement systems rely on various forms of an adaptive filtering operation. Such systems attenuate the time/frequency (T/F) regions of the noisy speech signal having low Signal-to-Noise-Ratios (SNR) while preserving those with high SNR. The essential components of speech are thus preserved while the noise component is greatly reduced. Usually, such a filtering operation is performed in the digital domain by a computational device such as a Digital Signal Processing (DSP) chip.
Subband domain processing is one of the preferred ways in which such adaptive filtering operations are implemented. Briefly, the unaltered speech signal in the time domain is transformed to various subbands by using a filterbank, such as the Discrete Fourier Transform (DFT). The signals within each subband are subsequently suppressed to a desirable amount according to known statistical properties of speech and noise. Finally, the noise suppressed signals in the subband domain are transformed to the time domain by using the inverse filterbank to produce an enhanced speech signal, the quality of which is highly dependent on the details of the suppression procedure.
An example of a typical prior art speech enhancement arrangement is shown in FIG. 1. The input is generated from digitizing the analog speech signal and contains both clean speech as well as noise. This unaltered audio signal y(n), where n=0,1, . . . ,∞ is the time index, is then sent to an analysis filterbank of filterbank function (“Analysis Filterbank”) 12, producing multiple subbands signals, Yk(m), k=1, . . . , K, m=0,1, . . . ,∞, where k is the subband number, and m is the time index of each subband signal. The subband signals may have lower sampling rates compared with y(n) due to the down-sampling operation in Analysis Filterbank 12. In a suppression rule device or function (“Suppression Rule”) 14, the noise level of each subband is then estimated by using a noise variance estimator. Based on the estimated noise level, appropriate suppression gains gk are determined, and applied to the subband signals as follows:{tilde over (Y)}k(m)=gkYk(m), k=1, . . . , K.  (1)The application of the suppression gains are shown symbolically by multiplier symbol 16. Finally, the subband signals {tilde over (Y)}k(m) are sent to a synthesis filterbank or filterbank function (“Synthesis Filterbank”) 18 to produce an enhanced speech signal {tilde over (y)}(n). For clarity in presentation, FIG. 1 shows the details of generating and applying a suppression gain to only one of multiple subband signals (k).
Clearly, the quality of the speech enhancement system is highly dependent on its suppression method. Spectral subtraction (reference [1]), the Wiener filter (reference [2]), the MMSE-STSA (reference [3]), and the MMSE-LSA (reference [4]_) are examples of such previously proposed methods. Suppression rules are designed so that the output is as close as possible to the speech component in terms of certain distortion criteria such as the Mean Square Error (MSE). As a result, the level of the noise component is reduced, and the speech component dominates. However, it is very difficult to separate either the speech component or the noise component from the original audio signal and such minimization methods rely on a reasonable statistical model. Consequently, the final enhanced speech signal is only as good as its underlying statistical model and the suppression rules that derive therefrom.
Nevertheless, it is virtually impossible to reproduce noise-free output. Perceptible residual noise exists because it is extremely difficult for any suppression method to track perfectly and suppress the noise component. Moreover, the suppression operation itself affects the final speech signal as well, adversely affecting its quality and intelligibility. In general, a suppression rule with strong attenuation leads to less noisy output but the resultant speech signal is more distorted. Conversely, a suppression rule with more moderate attenuation produces less distorted speech but at the expense of adequate noise reduction. In order to balance optimally such opposing concerns, careful trade-offs must be made. Prior art suppression rules have not approached the problem in this manner and an optimal balance has not as yet been attained.
Another problem common to many speech enhancement system is that of “musical noise”. (reference [1]). This processing artifact is a byproduct of the subband domain filtering operation. Residual noise components can exhibit strong fluctuations in amplitudes and, if not sufficiently suppressed, are transformed into short, bursty musical tones with random frequencies.