The present invention is directed to wireless and landline based telephone communications and, more particularly, to reducing acoustic noise, such as background noise and system induced noise, present in wireless and landline based communication.
The perceived quality and intelligibility of speech transmitted over a wireless or landline based telephone lines is often degraded by the presence of background noise, coding noise, transmission and switching noise, etc. or by the presence of other interfering speakers and sounds. As an example, the quality of speech transmitted during a cellular telephone call may be affected by noises such as car engines, wind and traffic as well as by the condition of the transmission channel used.
Wireless telephone communication is also prone to providing lower perceived sound quality than wire based telephone communication because the speech coding process used during wireless communication removes a portion of the sound. Further, when the signal itself is noisy, the noise is encoded with the signal and further degrades the perceived sound quality because the speech coders used by these systems depend on encoding models intended for clean signals rather than for noisy signals. Wireless service providers, however, such as personal communication service (PCS) providers, attempt to deliver the same service and sound quality as landline telephony providers to attain greater consumer acceptance, and therefore the PCS providers require improved end-to-end voice quality.
Additionally, transmitted noise degrades the capability of speech recognition systems used by various telephone services. The speech recognition systems are typically trained to recognize words or sounds under high transmission quality conditions and may fail to recognize words when noise is present.
In older wireline networks, such as are found in developing countries, system induced noise is often present because of poor wire shielding or the presence of cross talk which degrades sound quality. System induced noise is also present in more modern telephone communication systems because of the presence of channel static or quantization noise.
It is therefore desirable to provide wireless and landline telephone communication in which both the background noise and the system induced noise are reduced.
When noise reduction is carried out prior to encoding the transmitted signal, a significant portion of the additive noise is removed which results in better end-to-end perceived voice quality and robust speech coding. However, noise reduction is not always possible prior to encoding and therefore must be carried out after the signals have been received and/or decoded, such as at a base station or a switching center.
Existing commercial systems typically reduce encoded noise using spectral decomposition and spectral scaling. Known methods include estimating the noise level, computing the filter coefficients, smoothing the signal to noise ratio (SNR), and/or splitting the signal into respective bands. These methods, however, have the shortcomings that artifacts, known as musical noise, as well as speech distortions are produced.
Typically, the known noise reduction methods are based on generating an optimized filter that includes such methods as Wiener filtering, spectral subtraction and maximum likelihood estimation. However, these methods are based on assumed idealized conditions that are rarely present during actual transmission. Additionally, these methods are not optimized for transmitting human speech or for human perception of speech, and therefore the methods must be altered for transmitting speech signals. Further, the conventional methods assume that the speech and noise spectra or the sub-band signal to noise ratio (SNR) are known beforehand, whereas the actual speech and noise spectra change over time and with transmission conditions. As a result, the band SNR is often incorrectly estimated and results in presence of musical noise. Additionally, when Wiener filtering is used, the filtering is based on minimum means square error (MMSE) optimized conditions that are not always appropriate for transmitting speech signals or for human perception of the speech signals.
FIG. 1 illustrates a known method of spectral subtraction and scaling to filter noisy speech. A noisy speech signal is first buffered and windowed, as shown at step 102, and then undergoes a fast Fourier transform (FFT) into L frequency bins or bands, as shown at step 104. The energy of each of the bands is computed, as step 106 shows, and the noise level of each of the bands is estimated, as shown at step 110. The SNR is then estimated based on the computed energy and the estimated noise, as shown at step 108, and then a value of the filter gain is determined based on the estimated SNR, as shown at step 112. The calculated value of the gain is used as a multiplier value, as shown in step 114, and then the adjusted L frequency bins or bands undergo an inverse FFT or are passed through a synthesis filter bank, as step 116 shows, to generate an enhanced speech signal ybt.
Various methods of carrying out the respective steps shown in FIG. 1 are known in the art:
As an example, U.S. Pat. No. 4,811,404, titled “Noise Suppression System” to R. Vimur et al. which issued on Mar. 7, 1989, describes spectral scaling with sub-banding. The spectral scaling is applied in a frequency domain using a FFT and an IFFT comprised of 128 speech samples or data points. The FFT bins are mapped into 16 non-homogeneous bands roughly following a known Bark scale.
When the filtered gains are computed for each sub-band, the amount of attenuation for each band is based on a non-linear function of the estimated SNR for that band. Bands having a SNR value less than 0 dB are assigned the lowest attenuation value of 0.17. Transient noise is detected based on the number of bands that are below or above the threshold value of 0 dB.
Noise energy values are estimated and updated during silent intervals, also known as stationary frames. The silent intervals are determined by first quantizing the SNR values according to a roughly exponential mapping and by then comparing the sum of the SNR values in 16 of the bands, known as a voice metric, to a threshold value. Alternatively, the noise energy value is updated using first-recursive averaging of the channel energy wherein an integration constant is based on whether the energy of a frame is higher than or similar to the most recently estimated energy value.
Artifacts are removed by detecting very weak frames and then scaling these frames according the minimum gain value, 0.17. Sudden noise bursts in respective frames are detected by counting the number of bands in the frame whose SNR exceeds a predetermined threshold value. It is assumed that speech frames have a large number of bands that have a high SNR and that sudden noise burst is characterized by frames in which only a small number of bands have a high SNR.
Another example, European Patent No. EP 0,588,526 A1, titled “A Method Of And A System For Noise Suppression” to Nokia Mobile Phones Ltd. which issued on Mar. 23, 1994, describes using FFT for spectral analysis. Format locations are estimated whereby speech within the format locations is attenuated less than at other locations.
Noise is estimated only during speech intervals. Each of the filter passbands is split into two sub-bands using a special filter. The filter passbands are arranged such that one of the two sub-bands includes a speech harmonic and the other includes noise or other information and is located between two consecutive harmonic peaks.
Additionally, random flutter effect is avoided by not updating the filter coefficient during speech intervals. As a result, the filter gains convert poorly during changing noise and speech conditions.
A further example, U.S. Pat. No. 5,485,522, titled “System For Adaptively Reducing Noise In Speech Signals” to T. Solve et al. which issued on Jan. 16, 1996, is directed to attenuation applied in the time domain on the entire frame without sub-banding. The attenuation function is a logarithmic function of the noise level, rather than of the SNR, relative to a predefined threshold. When the noise level is less than the threshold, no attenuation is necessary. The attenuation function, however, is different when speech is detected in a frame rather than when the frame is purely noise.
A still further example, U.S. Pat. No. 5,432,859, titled “Noise Reduction System” to J. Yang et al. which issued on Jul. 11, 1995, describes using a sliding dual Fourier transform (DFT). Analysis is carried out on samples, rather than on frames, to avoid random fluctuation of flutter noise. An iterative expression is used to determine the DFT, and no inverse DFT is required. The filter gains of the higher frequency bins, namely those greater than 1 KHz, are set equal to the highest determined gain. The filter gains for the lower frequency bins are calculated based on a known MMSE-based function of the SNR. When the SNR is less than −6 dB, the gains are set to a predetermined small value.
It is desirable to provide noise reduction that avoids the weaknesses of the known spectral subtraction and spectral scaling methods.