A common problem in a hands free communication system is acoustic echo, and the problem can be formulated as follows. A digital input audio signal x(n) is received by a communication interface from a far-end site over a network such as the Internet or PSTN and played on a loudspeaker. A microphone generates a digital audio signal y(n) composed of an echo signal u(n) and a near-end sound signal v(n) such as speech from a near-end talker and background noise. The echo signal is composed of the direct signal and reflected versions (reflected by walls/ceilings etc.) of the loudspeaker signal. The microphone signal y(n) may be expressed as follows:y(n)=u(n)+v(n).  (1)
If the microphone signal y(n) were transmitted back to the far-end unmodified, the participants at the far end site would hear an echo of themselves, and if a similar system were present at the far-end site, even howling/feedback might occur.
One way to attenuate the echo signal is illustrated in FIG. 1, and is commonly referred to as acoustic echo cancellation (AEC). Here the room impulse response from the loudspeaker to the microphone (including the response of the loudspeaker and the microphone, and digital-to-analog and analog-to-digital converters which are not shown for simplicity) is modeled with an adaptive finite impulse response (FIR) filter with L coefficients given in the vector ĥ(n)=[ĥ0(n), ĥ1(n), . . . , ĥL-1(n)]T. An adaptive algorithm such as normalized least mean squares (NLMS) or recursive least squares (RLS) are used to continuously update the filter coefficients with the goal of approximating the room impulse response as accurately as possible. The closer the estimated filter is to the room impulse response, the better the estimated echo is, and the less echo is sent back to the far-end. However, due to the changing nature of the room impulse response as well as the near-end sound appearing on the microphone, there will always be some residual echo left after subtracting the estimated echo signal. Therefore, it is common to use a nonlinear processing (NLP) block to further suppress remaining echo.
In the full-band acoustic echo cancellation scheme of FIG. 1, the adaptive filter 1203 generates an estimate û(n) of the echo signal u(n). This estimated echo signal û(n) is subtracted from the microphone signal y(n) at node 1201 to generate the echo cancelled output signal e(n), according to Equation (2) as follows:e(n)=y(n)−û(n).  (2)For wideband audio and typical rooms the echo canceller in FIG. 1 requires a large number of filter coefficients in order to work satisfactory. This renders the echo canceller very computationally complex, even for simple adaptive algorithms such as NLMS. Moreover, even if computational complexity is of little concern, many of the most commonly used adaptive algorithms would suffer from slow convergence speed due to the high auto-correlation present in the signal x(n).
These problems are greatly reduced in the subband acoustic echo canceller illustrated in FIG. 2. In FIG. 2 the digital input signal x(n) received from the far-end, and passed to the loudspeaker, is divided into a predetermined number K of subbands X1(m), . . . , XK(m) using the analysis filterbank 3301, where m represents a time index. The microphone signal y(n) is also divided into K subbands Y1(m), . . . , YK(m) using a similar analysis filterbank 3302. For each subband, e.g. subband k, a subband reference signal Xk(m) is filtered through a subband FIR filter Hk(m) 3204 that calculates a subband echo estimate Ûk(m). The subband echo estimate Ûk(m) is subtracted from the corresponding subband microphone signal Yk(m) at node 2110 to create a subband echo cancelled microphone signal Ek(m). The echo cancelled microphone subband signal Ek(m) is used for adapting the FIR filter 3204, shown as the subband FIR filter update loop 3208. The echo cancelled microphone sub-band signals E1(m), . . . , EK(m) from all subbands are merged together to form a full band echo cancelled microphone signal by the synthesis filterbank 3303.
The narrow bandwidth of the frequency subbands allows for downsampling in the analysis filterbank. After downsampling, all subband processing run on a smaller rate, the number of coefficients of the adaptive filter in each subband are greatly reduced, and the loudspeaker subband signals Xk(m) has a lower auto-correlation compared to the fullband signal x(n). Compared to the system in FIG. 1, the system in FIG. 2 has lower computational complexity and faster convergence speed for many of the most commonly used adaptive algorithms. However, the acoustic echo cancellation systems in FIG. 1 and FIG. 2 do not work well during rapid changes in the phase response of the room impulse response. Such changes frequently occur on personal computers due to incorrect synchrony between the loudspeaker signal x(n) and the microphone signal y(n).
Modern acoustic echo suppression was proposed as a robust alternative to AEC in Carlos Avendano, Acoustic Echo Suppression in the STFT Domain, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2001, pages W2001-4, the subject matter of which is incorporated herein by reference. The reference will hereafter be referred to as [Avendano, 2001].
FIG. 3 illustrates the approach. As with subband echo cancellation, analysis filterbanks and a synthesis filterbank are used, so that processing can be done independently and efficiently on each subband. In the following, we consider only subband number k, while keeping in mind that the same processing is done for all the other subbands. Unlike in subband echo cancellation, where the complex subband echo signal Ûk(m) is estimated, only the magnitude |Ûk(m)| of the subband echo signal is needed in the acoustic echo suppression approach proposed in [Avendano, 2001]. The echo magnitude in subband k is formed by taking the magnitude of the complex echo estimate. The estimated echo magnitude is used to compute a time-varying subband gain defined as:
                                                        G              k                        ⁡                          (              m              )                                =                                    (                                                                                                                                                                  Y                          k                                                ⁡                                                  (                          m                          )                                                                                                            α                                    -                                      β                    ⁢                                                                                                                                                                              U                              ^                                                        k                                                    ⁡                                                      (                            m                            )                                                                                                                      α                                                                                                                                                                              Y                        k                                            ⁡                                              (                        m                        )                                                                                                  α                                            )                                      1              /              α                                      ,                            (        3        )            where the parameters α and β are used to control the amount of echo reduction versus signal distortion. The output Zk(m) in subband k is formed by multiplying Yk(m), which is the microphone signal in subband k, with the gain Gk(m). Often it is necessary, especially if the magnitude estimator is poor, to smooth the gains Gk(m) over either frequency or time. For an example of gain smoothing see [Faller and Chen, 2005]. Note that in (3) the phase of the echo estimate Ûk(m) is not used. This is an important feature for phase robustness. However, full robustness against phase variation is only achieved if the spectral magnitude estimator is robust. It is easy to see that the estimator in [Avendano, 2001] is not robust against phase changes. Consider for example what happens after a delay is introduced in the room impulse response. Then all the adaptive filter coefficients will be misaligned due to the changed phase and the adaptive filter must re-adapt.
A phase-robust acoustic echo suppressor was presented in Christof Faller and J. Chen, (2005), Suppressing acoustic echo in a spectral envelope space, IEEE Trans. Speech and Audio Processing, Vol 5, No. 13: page 1048-1062, hereafter reffered to as [Faller and Chen, 2005], the subject matter of which is incorporated herein by reference. Unlike the approach in [Avendano, 2001], where the echo magnitude in each subband is estimated from a sequence of complex subband samples, the approach in [Faller and Chen, 2005] aims at estimating the spectral envelope of the echo signal from the spectral envelope of the loudspeaker signal. In their work, the spectral envelope is taken to be the instantaneous power spectrum or magnitude spectrum smoothed over frequency. However, although this approach yields a fully phase-robust echo suppressor, the accuracy of the estimator is poor, even for a high number of adaptive filter coefficients. U.S. Pat. No. 7,062,040 to Faller also describes suppression of an echo signal, the entire contents of which is hereby incorporated by reference.