In wireless communications, background noise and static can be annoying in speaker to speaker conversation and a hindrance in speaker to machine recognition. As a result, noise suppression is an important part of the enhancement of speech signals recorded over wireless channels in mobile environments.
In that regard, a variety of noise suppression techniques have been developed. Such techniques typically operate on single microphone, output-based speech samples which originate in a variety of noisy environments, where it is assumed that the noise component of the signal is additive with unknown coloration and variance.
One such technique is Least Mean-Squared (LMS) Predictive Noise Cancelling. In this technique it is assumed that the additive noise is not predictable, whereas the speech component is predictable. LMS weights are adapted to the time series of the signal to produce a time-varying matched filter for the predictable speech component such that the mean-squared error (MSE) is minimized. The estimated clean speech signal is then the filtered version of the time series.
However, the structure of speech in the time domain is neither coherent nor stationary enough for this technique to be effective. A trade-off is therefore required between fast settling time, good tracking ability and the ability to track everything (including noise). This technique also has difficulty with relatively unstructured non-voiced segments of speech.
Another noise suppression technique is Signal Subspace (SSP) filtering (which here includes Spectral Subtraction (SS)). SSP is essentially a weighted subspace fitting applied to speech signals, or a set of bandpass filters whose outputs are linearly weighted and combined. SS involves estimating the (additive) noise magnitude spectrum, typically done during non-speech segments of data, and subtracting this spectrum from the noisy speech magnitude spectrum to obtain an estimate of the clean speech spectrum. If the resulting spectral estimate is negative, it is rectified to a small positive value. This estimated magnitude spectrum is then combined with the phase information from the noisy signal and used to construct an estimate of the clean speech signal.
SSP assumes the speech signal is well-approximated by a sum of sinusoids. However, speech signals are rarely simply sums of undamped sinusoids and can, in many common cases, exhibit stochastic qualities (e.g., unvoiced fricatives). SSP relies on the concept of bias-variance trade-off. For channels having a Signal-to-Noise Ratio (SNR) less than 0 dB, some bias is permitted to give up a larger dosage of variance and obtain a lower overall MSE. In the speech case, the channel bias is the clean speech component, and the channel variance is the noise component. However, SSP does not deal well with channels having SNR greater than zero.
In addition, SS is undesirable unless the SNR of the associated channel is less than 0 dB (i.e., unless the noise component is larger than the signal component). For this reason, the ability of SS to improve speech quality is restricted to speech masked by narrowband noise. SS is best viewed as an adaptive notch filter which is not well applicable to wideband noise.
Still another noise suppression technique is Wiener filtering, which can take many forms including a statistics-based channel equalizer. In this context, the time domain signal is filtered in an attempt to compensate for non-uniform frequency response in the voice channel. Typically, this filter is designed using a set of noisy speech signals and the corresponding clean signals. Taps are adjusted to optimally predict the clean sequence from the noisy one according to some error measure. Once again, however, the structure of speech in the time domain is neither coherent nor stationary enough for this technique to be effective.
Yet another noise suppression technique is Relative Spectral speech processing (RASTA). In this technique, multiple filters are designed or trained for filtering spectral subbands. First, the signal is decomposed into N spectral subbands (currently, Discrete Fourier Transform vectors are used to define the subband filters). The magnitude spectrum is then filtered with N/2+1 linear or non-linear neural-net subband filters.
However, the characteristics of the complex transformed signal (spectrum) have been elusive. As a result, RASTA subband filtering has been performed on the magnitude spectrum only, using the noisy phase for reconstruction. However, an accurate estimate of phase information gives little, if any, noticeable improvement in speech quality.
The dynamic nature of noise sources and the non-stationery nature of speech ideally call for adaptive techniques to improve the quality of speech. Most of the existing noise suppression techniques discussed above, however, are not adaptive. Such adaptation can be performed in various dimensions and at various levels. One type of adaptation where importance is given to noise characteristics and level is based on level of noise and level of distortion in a speech signal. However, for a given noise level, adaptation can also be done based on speech characteristics. The best solution being adaptation based simultaneously on noise characteristics as well as speech characteristics While some recently proposed techniques are designed to adapt to the noise level or SNR, none take into account the non-stationary nature of speech and try to adapt to different sound categories.
An article by Harris Ducker entitled "Speech Processing in a high ambient noise environment", IEEE Trans. Audio and Electroacoustics, Vol. 16, No. 2, June, 1968, pp. 165-168, discusses the effect of noise on different speech sounds and the resulting confusion among sound categories. While a high-pass filter is employed in an effort to resolve this confusion, such a filter is only used for some sound categories. Moreover, the classification of sound in this technique is only done manually by experiment.
Thus, there exists a need for a noise suppression technique which would automatically classify sounds and apply an appropriate filter for each class. Moreover, such a technique would use filtering that adapts to speech sounds.