Not applicable.
The present invention relates to suppressing noise in telecommunications systems. In particular, the present invention relates to suppressing noise in single channel systems or single channels in multiple channel systems.
Speech quality enhancement is an important feature in speech communication systems. Cellular telephones, for example, are often operated in the presence of high levels of environmental background noise present in moving vehicles. Background noise causes significant degradation of the speech quality at the far end receiver, making the speech barely intelligible. In such circumstances, speech enhancement techniques may be employed to improve the quality of the received speech, thereby increasing customer satisfaction and encouraging longer talk times.
Past noise suppression systems typically utilized some variation of spectral subtraction. FIG. 1 shows an example of a noise suppression system 100 that uses spectral subtraction. A spectral decomposition of the input noisy speech-containing signal 102 is first performed using the filter bank 104. The filter bank 104 may be a bank of bandpass filters such as, for example, the bandpass filters disclosed in R. J. McAulay and M. L. Malpass, xe2x80x9cSpeech Enhancement Using a Soft-Decision Noise Suppression Filter,xe2x80x9d IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, no. 2, (April 1980), pp. 137-145. In this context, noise refers to any undesirable signal present in the speech signal including: 1) environmental background noise; 2) echo such as due to acoustic reflections or electrical reflections in hybrids: 3) mechanical and/or electrical noise added due to specific hardware such as tape hiss in a speech playback system; and 3) non-linearities due to, for example, signal clipping or quantization by speech compression.
The filter bank 104 decomposes the signal into separate frequency bands. For each band, power measurements are performed and continuously updated over time in the noisy signal power and noise power estimator 106. These power measures are used to determine the signal-to-noise ratio (SNR) in each band. The voice activity detector 108 is used to distinguish periods of speech activity from periods of silence. The noise power in each frequency band is updated only during silence while the noisy signal power is tracked at all times. For each frequency band, a gain (attenuation) factor is computed in the gain computer I 10 based on the SNR of the band to attenuate the signal in the gain multiplier 112. Thus, each frequency band of the noisy input speech signal is attenuated based on its SNR. In this context, speech signal refers to an audio signal that may contain speech, music or other information bearing audio signals (e.g., DTMF tones, silent pauses, and noise).
A more sophisticated approach may also use an overall SNR level in addition to the individual SNR values to compute the gain factors for each band. The overall SNR is estimated in the overall SNR estimator 114. The gain factor computations for each band are performed in the gain computer 110. The attenuation of the signals in different bands is accomplished by multiplying the signal in each band by the corresponding gain factor in the gain multiplier. Low SNR bands are attenuated more than the high SNR bands. The amount of attenuation is also greater if the overall SNR is low. The possible dynamic range of the SNR of the input signal is large. As such, the speech enhancement system must be capable of handling both very clean speech signals from wireline telephones as well as very noisy speech from cellular telephones. After the attenuation process, the signals in the different bands are recombined into a single, clean output signal 116. The resulting output signal 116 will have an improved overall perceived quality.
In this context, speech enhancement system refers to an apparatus or device that enhances the quality of a speech signal in terms of human perception or in terms of another criteria such as accuracy of recognition by a speech recognition device, by suppressing, masking, canceling or removing noise or otherwise reducing the adverse effects of noise. Speech enhancement systems include apparatuses or devices that modify an input signal in ways such as, for example: 1) generating a wider bandwidth speech signal from a narrow bandwidth speech signal; 2) separating an input signal into several output signals based on certain criteria, e.g., separation of speech from different speakers where a signal contains a combination of the speakers"" speech signals; 3) and processing (for example by scaling) different xe2x80x9cportionsxe2x80x9d of an input signal separately and/or differently, where a xe2x80x9cportionxe2x80x9d may be a portion of the input signal in time (e.g., in speaker phone systems) or may include particular frequency bands (e.g., in audio systems that boost the base), or both.
The decomposition of the input noisy speech-containing signal can also be performed using Fourier transform techniques or wavelet transform techniques. FIG. 2 shows the use of discrete Fourier transform techniques (shown as the Windowing and FFT block 202). Here a block of input samples is transformed to the frequency domain. The magnitude of the complex frequency domain elements are attenuated at the attenuation unit 208 based on the spectral subtraction principles described above. The phase of the complex frequency domain elements are left unchanged. The complex frequency domain elements are then transformed back to the time domain via an inverse discrete Fourier transform in the IFFT block 204, producing the output signal 206. Instead of Fourier transform techniques, wavelet transform techniques may be used to decompose the input signal.
A voice activity detector may be used with noise suppression systems. Such a voice activity detector is presented in, for example, U.S. Pat. No. 4,351,983 to Crouse et al. In such detectors, the power of the input signal is compared to a variable threshold level. Whenever the threshold is exceeded, the system assumes speech is present. Otherwise, the signal is assumed to contain only background noise.
For most implementations of speech enhancement, it is desirable to minimize processing delay. As such, the use of Fourier or wavelet transform techniques for spectral decomposition is undesirable because these techniques introduce large delays when accumulating a block of samples for processing.
Low computational complexity is also desirable as the network noise suppression system may process multiple independent voice channels simultaneously. Furthermore, limiting the types of computations to addition, subtraction and multiplication is preferred to facilitate a direct digital hardware implementation as well as to minimize processing in a fixed-point digital signal processor-based implementation. Division is computationally intensive in digital signal processors and is also cumbersome for direct digital hardware implementation. Finally, the memory storage requirements for each channel should be minimized due to the need to process multiple independent voice channels simultaneously.
Speech enhancement techniques must also address information tones such as DTMF (dual-tone multi-frequency) tones. DTMF tones are typically generated by push-button/tone-dial telephones when any of the buttons are pressed. The extended touch-tone telephone keypad has 16 keys: (1,2,3,4,5,6,7,8,9,0,*,#,A,B,C,D). The keys are arranged in a four by four array. Pressing one of the keys causes an electronic circuit to generate two tones. As shown in Table 1, there is a low frequency tone for each row and a high frequency tone for each column. Thus, the row frequencies are referred to as the Low Group and the column frequencies, the High Group. In this way, sixteen unique combinations of tones can be generated using only eight unique tones. Table 1 shows the keys and the corresponding nominal frequencies. (Although discussed with respect to DTMF tones, the principles discussed with respect to the present invention are applicable to all inband signals. In this context, an inband signal refers to any kind of tonal signal within the bandwidth normally used for voice transmission such as, for example, facsimile tones, dial tones, busy signal tones, and DTMF tones).
DTMF tones are typically less than 100 milliseconds (ms) in duration and can be as short as 45 ms. These tones may be transmitted during telephone calls to automated answering systems of various kinds. These tones are generated by a separate DTMF circuit whose output is added to the processed speech signal before transmission.
In general, DTMF signals may be transmitted at a maximum rate of ten digits/second. At this maximum rate, for each 100 ms timeslot, the dual tone generator must generate touch-tone signals of duration at least 45 ms and not more than 55 ms, and then remain quiet during the remainder of the timeslot. When not transmitted at the maximum rate, a tone pair may last any length of time, but each tone pair must be separated from the next pair by at least 40 ms.
In past speech enhancement systems, however, DTMF tones were often partially suppressed. Suppression of DTMF tones occurred because voice activity detectors and/or DTMF tone detectors require some delay before they were able to determine the presence of a signal. Once the presence of a signal was detected, there was still a lag time before the gain factors for the appropriate frequency bands reached their correct (high) values. This reaction time often caused the initial part of the tones to be heavily suppressed. Hence short-duration DTMF tones may be shortened even further by the speech enhancement system. FIG. 7 shows an input signal 702 containing a 697 Hz tone 704 of duration 45 ms (360 samples). The output signal 706 is heavily suppressed initially, until the voice activity detector detects the signal presence. Then, the gain factor 708 gradually increases to prevent attenuation. Thus, the output is a shortened version of the input tone, which in this example, does not meet general minimum duration requirements for DTMF tones.
As a result of the shortening of the DTMF tones, the receiver may not detect the DTMF tones correctly due to the tones failing to meet the minimum duration requirements. As can be seen in FIG. 7 the gain factor 708 never reaches its maximum value of unity because it is dependent on the SNR of the band. This causes the output signal 706 to be always attenuated slightly, which may be sufficient to prevent the signal power from meeting the threshold of the receiver""s DTMF detector. Furthermore, the gain factors for different frequency bands may be sufficiently different so as to increase the difference in the amplitudes of the dual tones. This further increases the likelihood that the receiver will not correctly detect the DTMF tones.
The shortcomings discussed above were present in past noise suppression systems. The system disclosed in, for example, in U.S. Pat. Nos. 4,628,529, 4,630,304, and 4,630,305 to Borth et al. was designed to operate in high background noise environments. However, operation under a wide range of SNR conditions is preferable. Furthermore, software division is used in Borth""s methods. Computationally intensive division operations are also used in U.S. Pat. No. 4,454,609 to Kates. The use of minimum mean-square error log-spectral amplitude estimators such as that disclosed in U.S. Pat. No. 5,012,519 to Adlersberg et al. are also computationally intensive. Furthermore, the system disclosed in Adlersberg uses Fourier transforms for spectral decomposition that introduce undesirable delay. Moreover, although a DTMF tone generator is presented in Texas Instruments Application Report, xe2x80x9cDTMF Tone Generation and Detection: An Implementation Using the TMS320C54x,xe2x80x9d 1997, pp. 5-12, 20, A-1, A-2, B-1, B-2, there are no systems that extend and/or regenerate suppressed DTMF tones.
A need has long existed in the industry for a noise suppression system having low computational complexity. Moreover, a need has long existed in the industry for a noise suppression system capable of extending and/or regenerating partially a suppressed DTMF tones.
An apparatus embodiment of the invention is useful in a communications system for processing a communication signal comprising speech and noise components derived from speech and noise. In such an environment, the quality of the communication signal can be enhanced by providing a processor arranged to:
divide the communication signal into a plurality of frequency band signals including speech and noise components due to said speech and noise;
generate first power signals for the frequency band signals, each first power signal being based on estimating over a first time period the power of one of said frequency band signals;
generate second power signals for the frequency band signals, each second power signal being based on estimating over a second time period less than the first time period the power of one of said frequency band signals;
generate condition signals representing conditions of the frequency band signals in response to predetermined relationships between at least the first power signals and second power signals;
adjust the gain of the frequency band signals in response to the condition signals to generate adjusted frequency band signals; and
combine the adjusted frequency band signals to generate an adjusted communication signal.
A method embodiment of the invention is useful in a communications system for processing a communication signal comprising speech and noise components derived from speech and noise. In such an environment, the quality of the communication signal is enhanced by a method comprising:
dividing the communication signal into a plurality of frequency band signals including speech and noise components due to said speech and noise;
generating first power signals for the frequency band signals, each first power signal being based on estimating over a first time period the power of one of said frequency band signals;
generating second power signals for the frequency band signals, each second power signal being based on estimating over a second time period less than the first time period the power of one of said frequency band signals;
generating condition signals representing conditions of the frequency band signals in response to predetermined relationships between at least the first power signals and second power signals;
adjusting the gain of the frequency band signals in response to the condition signals to generate adjusted frequency band signals; and
combining the adjusted frequency band signals to generate an adjusted communication signal.
The aforementioned method of adapting the NSR values during speech is different from that used in the presence of DTMF tones. For DTMF tones, the quick adjustment of the NSR values for the appropriate frequency bands containing the DTMF tones maximizes the amount of the DTMF tones that are passed through transparently. In the case of speech, the NSR values are preferably adapted more slowly to correspond to the nature of speech signals.
In an alternative embodiment of the present invention, a method for suppressing noise is presented.
An alternative embodiment of the present invention includes a method and apparatus for extending DTMF tones. Yet another embodiment of the present invention includes regenerating DTMF tones.