This invention is in the field of signal processing, and is more specifically directed to noise suppression in the telecommunication of human speech.
Recent advances in telecommunications technology have resulted in widespread use of telephonic equipment in relatively noisy environments. For example, portable cellular telephones are now often used in automobiles, out of doors, or in other environments having significant background acoustic noise. The level of acoustic noise is exacerbated in hands-free cellular telephones, particularly when used in automobiles. High levels of noise are not limited to wireless telephones, as speakerphones are now commonly used in many homes and offices. As a result, techniques for the suppression of noise (or, conversely, the enhancement of signal) are of particular importance in the field of telecommunications.
So-called "active" noise suppression techniques have been developed for use in some telephonic applications. Active noise suppression relies on the presence of multiple microphones, such as may be present in advanced teleconferencing systems; analysis and combination of the signals received by the multiple microphones is then used to identify and suppress noise components in the received signal. However, cost considerations have resulted in the widespread prevalence of single microphone telephonic equipment, particularly in the wireless telephone market, and for which active noise suppression techniques are not an option.
"Passive" noise suppression techniques refer to the class of approaches in which the amplitude of noise in a transmitted signal is reduced through processing of a signal from an individual source. A major class of passive noise suppression techniques is referred to in the art as spectral subtraction. Spectral subtraction, in general, considers the transmitted noisy signal as the sum of the desired speech with a noise component. The spectrum of the noise component is estimated, generally during time windows that are determined to be "non-speech". The estimated noise spectrum is then subtracted, in the frequency domain, from the transmitted noisy signal to yield the remaining desired speech signal.
A typical spectral subtraction routine, as implemented in conventional digital wireless telephone equipment, is based on the Fast Fourier Transform (FFT), as is readily performable by digital signal processors (DSPs) such as those available from Texas Instruments Incorporated. Examples of spectral subtraction approaches are described in Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-27, No. 2 (April, 1979), pp. 113-120, and in Berouti, et al., "Enhancement of Speech Corrupted by Acoustic Noise", Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing (IEEE, April 1979), pp. 208-211. In this conventional approach, an FFT is performed to transform the noisy speech signal into the frequency domain. Spectral subtraction utilizes a frequency-domain filter operator G(.omega.) that is derived from an estimate P.sub.n (.omega.) of the power spectrum of the noise in the signal and the power spectrum P.sub.x (.omega.) of the noisy speech signal X(.omega.). Typically, the estimate of the noise power spectrum is based on the assumption that noise is constant over both speech and non-speech time intervals of the signal; the noise power spectrum estimate P.sub.n (.omega.) is thus simply set equal to the power spectrum P.sub.x (.omega.) of the input signal X(.omega.) during non-speech intervals. The conventional frequency-domain filter operator G(.omega.) is derived as: ##EQU1## This frequency-domain filter operator G(.omega.) is applied to the noisy speech spectrum X(.omega.) to produce an estimate S(.omega.) of the spectrum of the speech component as follows: EQU S(.omega.)=G(.omega.)X(.omega.)
Inverse FFT of the estimate S(.omega.) will then render a filtered time-domain speech signal.
The quality of a noise suppression technique depends, of course, upon its ability to eliminate acoustic noise without distorting the speech signal, and without itself introducing noise into the signal. While spectral subtraction does reduce the level of noise in the signal, other undesirable effects have been observed. One such effect is the introduction of "musical noise" into the signal which appears during non-speech intervals in the signal. Musical noise is due to measurement error in the estimate of the noise power spectrum, which causes the filter operator G(.omega.) to randomly vary across frequency and over time, producing fluctuating tonal noise that some observers have found to be more annoying than the original background acoustic noise. In addition, inaccuracies in distinguishing between speech and non-speech intervals, as necessary in estimating the noise spectrum, have been observed to clip the desired speech signal (when falsely detecting a non-speech interval) and to be insensitive to changes in the background noise (in effect, falsely detecting a speech interval).
By way of further background, division of noisy speech signals into multiple sub-bands for noise suppression processing is known in the art, for example as described in Yang, "Frequency Domain Noise Suppression Approaches in Mobile Telephone Systems", Proceedings of the ICASSP-93, Vol. II (1993), pp. 363-366, relative to spectral subtraction techniques. Sub-band division of the noisy speech signal is also known in connection with the noise suppression technique of all-pole based Weiner filtering, as described in Yoo, "Selective All-Pole Modeling of Degraded Speech Using M-Band Decomposition", Proceedings of the ICASSP-96 (1996), pp. 641-644. Each of these approaches divide the input signal into substantially equally spaced frequency bands.
By way of further background, another type of noise suppression utilizes the simultaneous masking effect of the human ear. It has been observed that the human ear ignores, or at least tolerates, additive noise so long as its amplitude remains below a masking threshold in each of multiple critical frequency bands within the human ear; as is well known in the art, a critical band is a band of frequencies that are equally perceived by the human ear. Virag, "Speech Enhancement Based on Masking Properties of the Auditory System", Proceedings of the ICASSP-95 (1995), pp. 796-799, describes a technique in which masking thresholds are defined for each critical band, and are used in optimizing spectral subtraction to account for the extent to which noise is masked during speech intervals. Azirani, et al., "Optimizing Speech Enhancement by Exploiting Masking Properties of the Human Ear", Proceedings of the ICASSP-95 (1995), pp. 800-803, use sub-band masking thresholds to determine, for each time interval, whether noise is masked. Optimal estimators are then derived for the masked and unmasked states to reduce both musical noise and speech distortion in noisy speech signal. Each of the Virag and Azirani et al. approaches utilizes an FFT "front-end", with the critical band analysis used in calculation of gain factors only.
By way of still further background, signal processing transforms known as the extended lapped transform (ELT) and hierarchical lapped transform (HLT) are known in the art. These transforms are described as providing an intermediate solution between the efficient technique of transform coding which is not particularly suitable for the implementation of bandpass filter banks, and the perfect reconstruction provided by sub-band coding, at an expense of computational complexity. Examples of the HLT and ELT signal processing techniques are described in H. S. Malvar, "Lapped Transforms for Efficient transform/Sub-band Coding," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 38, No. 6 (June 1990) pp. 969-978; H. S. Malvar, "Extended Lapped Transforms: Properties, Applications, and Fast Algorithms," IEEE Transactions on Signal Processing, Vol. 40, No. 11 (November 1992) pp. 2703-2714; and H. S. Malvar, "Efficient Signal Coding with Hierarchical Lapped Transforms," Proceedings of the IEEE International Conference on Acoustics, Speech and, Signal Processing (ICASSP-90) (April 1990) pp. 1519-1522.