A noise suppressor in an audio digital communication systems aims to take an audio stream in the presence of background noise and reduce the noise level without degrading signal characteristics or quality. Generally a noise suppressor may be used with a wide variety of audio inputs such as speech or music, and a variety of noise inputs, such as noise generated by a car, fan, train, airplane, and/or babble noise.
To estimate background noise, a spectrum analysis of a time domain audio stream is carried out to give its frequency composition. For an audio stream comprising speech, stationary states associated with speech are generally characterized by durations of about 10 milliseconds. By contrast, background noise in conventional noise suppressors is assumed to be long-term stationary, having a characteristic duration of at least about 0.5 seconds. If spectra recorded over this latter time scale are analyzed, the long-term stationary parts as a function of frequency may be taken as an estimate of the noise.
In the prior art, a variety of noise estimation and noise subtraction algorithms have been developed. Generally, an audio stream is sampled and segmented into consecutive time frames, each optionally having a same duration and comprising a plurality of sequential samples of the audio stream acquired for the period of the time frame. Time frames are labeled by m, where m=0 denotes a current time frame, m=−1 denotes an immediately preceding time frame, and so forth. The samples in each frame define a function of time that represents the audio stream for the period of the time frame.
The samples in the current frame are processed using a Fourier transform to define a frequency spectrum for the audio stream for the period of time of the frame. A frequency range of the spectrum for all frames is divided into a same plurality of frequency bands, and for each frequency band in a given frame, an average value of audio energy spectral density is determined. Optionally, 16 frequency bands of unequal widths are constructed.
The average audio energy associated with each band is hereinafter referred to as “audio spectral energy” or “audio energy” for the band. The audio energies for all the bands for a given frame are referred to as an “audio spectrum” and the audio spectrum for a current frame (m=0) is referred to as the “current audio spectrum”.
For a current frame, a value for noise energy spectral density that contributes to the audio spectral energy in a frequency band is determined responsive to the audio spectral energy for the band during a period of time T that includes the current frame and a plurality of previous frames. For convenience of presentation, noise energy spectral density for a given frequency band is referred to as “noise energy” for the band and noise energy in the given frequency band for the time T is referred to as “current noise energy” for the band. The noise energies for all the bands for a given frame are referred to as the “noise spectrum”, and the noise spectrum for the current frame is referred to as the “current noise spectrum”.
M. Recchione, “The Enhanced Variable Rate Coder; Toll Quality Speech for CDMA”, Int. Journ. Speech Tech. 2 (1999) 305-315, and S. Rangachari, P. C. Loizou, “A Noise-Estimation Algorithm for highly non-stationary environments”, Speech Communication 48 (2006) 220-231, describe an Enhanced Variable Rate Coder (EVRC) standardized by Telecommunications Industry Association as IS-127. EVRC noise suppression comprises methods described above, including formation of audio spectra in a total of 16 bands. U.S. Pat. No. 4,811,404, incorporated herein by reference, describes a noise suppression method that comprises formation of audio spectra in a total of 16 bands.
The current noise spectrum is used to filter out background noise from a current audio spectrum. Some prior art methods estimate current noise energy for each band (and thereby the current noise spectrum) with the help of speech presence detectors that distinguish noise from speech. Some noise suppressors select minimum audio energies as a function of frequency during time T to represent noise energies. The estimated noise spectrum is used to calculate gain (attenuation) factors for a filter in order to filter out noise and thereby reduce noise from the current audio spectrum. The filter comprises gain factors calculated separately for each band. A lower limit is set for the gain factors to prevent over-reduction of audio energies for frequency bands having very low signal to noise ratio (SNR). A filtered frequency domain audio spectrum is formed by multiplying audio energy in each band by the gain factor of the band of the current audio spectrum. The filtered spectrum is then transformed back from the frequency to the time domain to yield a noise-filtered audio stream having enhanced overall perceived quality.
However, speech quality from prior art noise suppressors generally tends to degrade in relatively high noise environments. Some noise suppressors cause noise flutter, so-called “musical noise”, composed of tones at random frequencies that are perceptually unpleasant because of their instability. U.S. Pat. No. 5,943,429, 7,058,572 B1, 6,766,292 B1, 6,415,253 B1, incorporated herein by reference, have modified spectral subtraction algorithms in order to reduce “musical noise”. Berouti et al., in a publication entitled “Enhancement of Speech Corrupted by Acoustic Noise,” Proc. IEEE ICASSP, pp. 208-211 (April 1979), have clamped gain factors so that the gain factors have a predetermined lower limit. In addition, Berouti et al. propose increasing the noise power spectral estimate by a small margin, a compensation method referred to as “oversubtraction.” Although clamping and oversubtraction reduce musical noise, they may do so at a cost of degraded speech intelligibility.
Hirsch and Ehrlicher, in a publication entitled “Noise Estimation Techniques for Robust Speech Recognition” (Proc. IEEE Int. Conf. on Acoustics Speech Signal Processing, 1995, pp 153-156), incorporated herein by reference, estimate noise spectra in an audio stream based on an estimate of minimum audio energy during a time period T (about 0.5 seconds) that includes the current frame and a plurality of previous frames. Ris and Dupont, in a publication entitled “Assessing local noise level estimation methods: Application to noise robust ASR” (Speech Communication 34 (2001) pp. 141-158), incorporated herein by reference, review methods of estimating noise spectra in an audio stream. They describe an “envelope follower” method based on energy evolution within frequency bands and in temporal segments covering several hundred milliseconds.
U.S. Pat. No. 6,766,292B1, incorporated herein by reference, describes a method of detecting speech versus noise, and thereby estimating a noise spectrum. The method uses a probabilistic speech presence measure. In some of the prior art, the estimates of noise spectra are carried out adaptively, in response to a continuous update of noise energy estimates. The noise spectrum estimate of U.S. Pat. No. 6,766,292B1 is made adaptively, responsive to updated estimates of signal to noise ratio (SNR). U.S. Pat. No. 6,445,801, incorporated herein by reference, uses frequency filtering comprising adaptive over-subtraction to suppress noise in an audio stream. U.S. Pat. No. 6,643,619 B1, incorporated herein by reference, uses a noise suppressor having an adaptive filter.