This invention relates to communication system noise cancellation techniques, and more particularly relates to detection of signals in such systems derived from speech.
The need for speech quality enhancement in single-channel speech communication systems has increased in importance especially due to the tremendous growth in cellular telephony. Cellular telephones are operated often in the presence of high levels of environmental background noise, such as in moving vehicles. Such high levels of noise cause significant degradation of the speech quality at the far end receiver. In such circumstances, speech enhancement techniques may be employed to improve the quality of the received speech so as to increase customer satisfaction and encourage longer talk times.
Most noise suppression systems utilize some variation of spectral subtraction. FIG. 1A shows an example of a typical prior noise suppression system that uses spectral subtraction. A spectral decomposition of the input noisy speech-containing signal is first performed using the Filter Bank. The Filter Bank may be a bank of bandpass filters (such as in reference [1], which is identified at the end of the description of the preferred embodiments). The Filter Bank decomposes the signal into separate frequency bands. For each band, power measurements are performed and continuously updated over time in the Noisy Signal Power and Noise Power Estimation block. These power measures are used to determine the signial-to-noise ratio (SNR) in each band. The Voice Activity Detector is used to distinguish periods of speech activity from periods of silence. The noise power in each band is updated primarily during silence while the noisy signal power is tracked at all times. For each frequency band, a gain (attenuation) factor is computed based on the SNR of the band and is used to attenuate the signal in the band. Thus, each frequency band of the noisy input speech signal is attenuated based on its SNR.
FIG. 1B illustrates another more sophisticated prior approach using an overall SNR level in addition to the individual SNR values to compute the gain factors for each band. (See also reference [2].) The overall SNR is estimated in the Overall SNR Estimation block. The gain factor computations for each band are performed in the Gain Computation block. The attenuation of the signals in different bands is accomplished by multiplying the signal in each band by the corresponding gain factor in the Gain Multiplication block. Low SNR bands are attenuated more than the high SNR bands. The amount of attenuation is also greater if the overall SNR is low. After the attenuation process, the signals in the different bands are recombined into a single, clean output signal. The resulting output signal will have an improved overall perceived quality.
The decomposition of the input noisy speech-containing signal can also be performed using Fourier transform techniques or wavelet transform techniques. FIG. 2 shows the use of discrete Fourier transform techniques (shown as the Windowing and FFT block). Here a block of input samples is transformed to the frequency domain. The magnitude of the complex frequency domain elements are attenuated based on the spectral subtraction principles described earlier. The phase of the complex frequency domain elements are left unchanged. The complex frequency domain elements are then transformed back to the time domain via an inverse discrete Fourier transform in the IFFT block, producing the output signal. Instead of Fourier transform techniques wavelet transform techniques may be used for decomposing the input signal.
A Voice Activity Detector is part of many noise suppression systems. Generally, the power of the input signal is compared to a variable threshold level. Whenever the threshold is exceeded, speech is assumed to be present. Otherwise, the signal is assumed to contain only background noise. Such two-state voice activity detectors do not perform robustly under adverse conditions such as in cellular telephony environments. An example of a voice activity detector is described in reference [5].
Various implementations of noise suppression systems utilizing spectral subtraction differ mainly in the methods used for power estimation, gain factor determination, spectral decomposition of the input signal and voice activity detection. A broad overview of spectral subtraction techniques can be found in reference [3]. Several other approaches to speech enhancement, as well as spectral subtraction, are overviewed in reference [4].
The commonly used two-state voice activity detection schemes have limited the performance of prior adaptive noise cancellation systems. This invention addresses and provides one solution for such problems.
The preferred embodiment of the present invention is useful in a communication system for processing a communication signal derived from speech and noise. In such an environment, the preferred embodiment is capable of determining the likelihood that the communication signal results from at least some speech. In order to achieve this result, a first power signal representing the power of at least a portion of the communication signal estimated over a first time period is calculated, and a second power signal representing the power of at least a portion of the communication signal estimated over a second time period longer than the first time period also is calculated. A comparison signal having a value related to the likelihood that the portion of the communication signal results from at least some speech is generated by comparing a first expression involving the first power signal with a second expression involving the second power signal. One or more speech likelihood signals are generated having a first value representing a first likelihood that the communication signal results from at least some speech in the event that the comparison signal value falls within a first range, having a second value representing a second likelihood that tile communication signal results from at least some speech in the event that the comparison signal value falls within a second range and having a third value representing a third likelihood that the communication signal results from at least some speech in the event the comparison signal falls within a third range. The First, second and third likelihoods differ in value.
According to the preferred embodiment, the preceding calculating and signal generation is performed by a calculator, for example, a digital signal processor.
By using the foregoing techniques, the likelihood that a communication signal results from speech can be determined with a degree of ease and accuracy unattained by the known prior techniques.