Speech communication in wireless communication networks involves the transmission of a near-end speech signal to a far-end user. The problem is to estimate a clean speech signal from a captured noisy speech signal.
A mobile-phone can be equipped with a single or multiple microphones to capture the speech signal. Single-microphone solutions show room for improvement at low signal-to-noise ratio (SNR) with respect to speech intelligibility, which is most likely due to the low-frequency content of background noise. Dual-microphone solutions, implying availability of two distinct sensors to simultaneously capture the sound field, allow for the possible usage of spatial information and characteristics of sound sources such as the spatial coherence of the captured signals. These characteristics are related to the relative placement of the two microphones on the mobile-phone unit as well as the design and usage of the mobile-phone.
One way of implementing a dual-microphone solution is to use a reference microphone signal with low SNR combined to a primary microphone capturing the desired speech signal as well as the noise to achieve an adaptive noise cancellation. In other words, a far-mouth microphone, referred to as a reference microphone, is used in conjunction with a near-mouth microphone, referred to as a primary microphone. The signal captured by the reference-microphone is used by an adaptive filter to estimate the noise signal at the primary microphone. A subtractor produces an error signal from the difference between the primary-microphone signal and the estimated noise signal. The error signal and the reference signal are used to optimize the suppression of the correlated noise at the microphones.
Many background noise environments, such as a car cabin and an office, can be characterized by a diffuse noise field. A perfectly diffuse noise field is typically generated in an unbounded medium by distant, uncorrelated sources of random noise evenly distributed over all directions. Diffuse noise presents a high spatial coherence at the low frequencies and a low coherence at the high frequencies. Hence, the standard noise canceller presents the possibility of high noise reduction at low frequencies for far-field noise. However, the performance is dependent on the location of the microphones. Since the desired speech signal also may be captured by the reference microphone, although with relatively low power, a signal comprising the desired speech will be correlated at the two microphones and this signal may partially be cancelled by such method. Additionally, the captured speech will be present in the error signal used to adjust the speed of convergence of the adaptive filter, resulting in greater filter variations. When speech is present in the captured sound field the adaptation of the filter weights should be stalled.
Methods have previously been suggested to adjust the step size controlling the convergence speed of the adaptive filter based on the detection of near-end speech. For instance, in U.S. Pat. No. 5,953,380 the step size is adjusted based on an estimate of the SNR. The SNR estimation is performed using a secondary adaptive filter which uses the reference-microphone signal as an input to estimate the captured noise signal. The estimated noise signal is used to calculate the noise power and is also subtracted from the primary microphone signal to generate an estimate of the speech signal. The estimated speech signal is in turn used to update the secondary filter weights. An SNR estimate of the captured sound field is subsequently calculated based on the power estimates of the speech and the noise.
Another implementation of a noise canceller was suggested in U.S. Pat. No. 6,963,649, where the adaptation of the primary adaptive filter is done for each frequency bin individually based on the comparison of the subband signal power of the output from the noise canceller to a different threshold for each band. Also a one tap adaptive filter is working as a gain optimizing the suppression of the noise prior to the multi-tap subband adaptive filter.
The solution suggested in U.S. Pat. No. 5,953,380 does not take into consideration the presence of speech at the reference microphone input when the microphones are positioned in a close range such as in a mobile phone unit, which affects the SNR estimation.
The comparison of the filters output signal to a threshold in the frequency domain, as suggested in U.S. Pat. No. 6,963,649 is not a robust solution since the noise also can have high subband content, especially at low frequencies, and thus not be cancelled at those frequencies.
Also, in both U.S. Pat. No. 5,953,380 and in U.S. Pat. No. 6,963,649, the adaptation is stalled either in fullband or in individual subband when speech presence is detected, which means that the algorithm needs to re-converge each time the speech is interrupted.