1. Field of the Invention
The present invention relates generally to noise suppression systems, and, more particularly, to a novel technique for estimating the background noise power spectrum for a spectral subtraction noise suppression system.
2. Description of the Prior Art
Acoustic noise suppression has been implemented in a wide variety of speech communications, varying from basic hearing aid applications to highly sophisticated military aircraft communications systems. The common objective in all such noise suppression systems is that of enhancing the quality of speech in an environment having a relatively high level of ambient background noise. The acoustic noise suppression system must augment the quality characteristics of the speech signal by reducing the background noise level without significantly degrading the voice intelligibility.
A possible solution to this problem is to incorporate an acoustic noise suppression prefilter, which effectively subtracts an estimate of the background noise signal from the noisy speech waveform, to perform the noise cancellation function. One technique for obtaining the estimate of the background noise is to implement a second microphone, located at a distance away from the user's first microphone, such that it picks up only background noise. This technique has been shown to provide a significant improvement in signal-to-noise ratio (SNR). However, it is very difficult to achieve the required isolation of the second microphone from the speech source while at the same time attempting to pick up the same background noise environment as the first microphone.
Another method for obtaining the background noise estimate is to estimate statistics of the background noise during the time when only background noise is present, such as during the pauses in human speech. This method is based on the assumption that the background noise is predominantly stationary, which is a valid assumption for many types of noise environments. Therefore, some mechanism for discriminating between background noise and speech is required.
Several approaches to the problem of distinguishing between speech and noise are known in the art. A summary of some of these techniques is found in P. De Souza, "A Statistical Approach to the Design of an Adaptive Self-Normalizing Silence Detector," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, no. 3, (June 1983), pp. 678-684, and the references contained therein. These prior art techniques implement various combinations of: (a) frame-to-frame energy; (b) zero-crossing rate; and (c) autocorrelation function or LPC coefficients.
In abnormally high noise environments, such as a moving vehicle, many of these known and referenced prior art techniques break down. For example, it has been widely documented that many types of noise do not lend themselves to an all-pole model, thereby not permitting an LPC fit. Furthermore, discrimination between speech and noise in a high background noise environment on the basis of zero-crossings has also been shown to be ineffective due to the similar zero crossing characteristics of speech and noise.
The frame energy parameter has been found to be the most effective technique to discriminate between noise and speech. Consequently, the majority of speech recognition systems and communications systems which are designed for use in high ambient noise environments makes use of some variation of this technique.
Unfortunately, the speech/noise classification on the basis of frame energy measurements has been effective only for voiced sounds due to the similar energy characteristics of unvoiced sounds and background noise. It is widely known that the energy histogram technique for distinguishing between speech and noise performs sufficiently well in normal ambient noise environments. Since energy histograms of acoustic signals exhibit a bimodal distribution, in which the two modes correspond to noise and speech, then an appropriate threshold can be set between the two modes to provide the speech/noise classification. (See, e.g., W. J. Hess, "A Pitch-Synchronous Digital Feature Extraction System for Phonemic Recognition of Speech," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, no. 1 (February 1976), pp. 14-25.) The disadvantage of this approach is that the distinction between background noise energy and unvoiced speech energy in relatively high noise environments is unclear. Consequently, the task of accurately finding the two modes of the energy histogram and setting the appropriate threshold between them is extremely difficult.