In the following, commonly used techniques for AEC, AES, and NS are described.
Acoustic Echo Canceler (AEC)
Traditionally, echo cancellation is accomplished by adaptively identifying the echo path impulse response and subtracting an estimate of the echo signal from the microphone signal. The far-end talker signal x(n) (loudspeaker signal) goes through the echo path, whose impulse response is modeled as an FIR filter, and adds to the microphone signal y(n) together with the near-end talker signal v(n) and the ambient noise w(n):y(n)=hTx(n)+v(n)+w(n),  (1)wherex(n)=[x(n), x(n−1), . . . x(n−J+1)]T,h=[h0, h1, . . . , hJ−1]T,
J is the length of the echo path impulse response, and T denotes the transpose of a vector or a matrix. To cancel the echo in the microphone signal, an echo estimate ŷ(n) is needed, which is generated by passing the far-end talker signal through an FIR filterĥ=[ĥ0, ĥ1, . . . ĥK−1]T  (2)of length K (generally less than J),ŷ(n)=[ĥT,0]x(n)  (3)
The FIR filter coefficients are estimated adaptively in time. Subtracting ŷ(n) from the microphone signal y(n) yields the error signale(n)=y(n)−ŷ(n)  (4)
The mean square error (MSE) can be expressed asE{e2(n)}=E{(y(n)−ŷ(n))2}  (5)where E{·} denotes mathematical expectation. The objective of the AEC is to estimate an ĥ that minimizes E{e2(n)}.
There is a vast literature addressing how to search for the optimum ĥ using adaptive techniques. Commonly used algorithms include normalized least-mean-square (NLMS), recursive least-squares (RLS), proportionate NLMS (PNLMS), affine projection algorithm (APA), etc.
During doubletalk (when the talker at the near-end is talking, i.e. v(n)≈0, usually the adaptive filter coefficients are “frozen” to prevent that the near-end signal v(n) has a negative effect on the adaptive filter in terms of estimating the acoustic echo path. For this purpose, a doubletalk detector is used.
Another solution was described to suppress the echo in an audio signal in the document US 2004/0057574. This is achieved by computing the spectral envelopes of the loudspeaker and microphone signals and determining the spectral envelope of the echo signal using adaptive filters. This approach attempts to continuously estimate the time varying spectral envelope of the echo signal. The problem with this technique is that the adaptive filters predicting the spectral envelopes need to re-converge every time when the properties of the loudspeaker signal change. That is, because the echo signal spectral envelope does not only depend on the loudspeaker signal spectral envelope but also on the loudspeaker signal itself. This signal dependence of the adaptive filters results in the problem that often the echo signal spectral envelope is not estimated precisely enough and the echo is not sufficiently removed.
Also this technique addresses the problem of acoustic echo removal in the microphone signal. It uses a stereo sampling unit for converting both, microphone and loudspeaker signals, to sample these signals. The transfer function between the loudspeaker and microphone signal is computed. Given the microphone signal, loudspeaker signal, and estimated transfer function ideally an interference free (echo free) microphone signal is generated, alternatively in the time or frequency domain. In the frequency domain, the loudspeaker spectrum is multiplied with the transfer function and then subtracted from the microphone signal to remove the echo. In the time domain, equivalently, the loudspeaker signal is convoluted with the filter (time domain version of the transfer function) and subtracted from the microphone signal.
In the document US2003/0156273, a cancellation approach is proposed in which the system should first determine the accurate echo path to subtract the echo signal from the microphone signal.
While it seems that this approach does consider only part of the true echo path, due to considering a single FFT spectrum of the loudspeaker and microphone signals, it relies similarly as AEC on CANCELLATION of the echo and not SUPPRESSION, as indicated by the subtraction operation of the filter unit in either frequency or time domain. This makes this approach, similarly as AEC, very sensitive to echo path changes. We are addressing this issue by not estimating a transfer function directly corresponding to the echo path, but merely estimating real valued gain factors (denoted coloration effect values) which only model the energy transfer from loudspeaker to microphone, and ignore phase information. Further, our approach increases robustness and decreases computation complexity by considering lower frequency resolution mimicking the frequency resolution of the auditory system. While, with our approach, CANCELLATION is not possible, due to the lack of a precise echo estimate (due to ignoring the phase), we are applying only spectral magnitude modification to suppress the echo (thus we do not, as the filter unit in the mentioned approach, subtract an echo estimate to remove the echo).