Acoustic Echo Cancellation (AEC) is a digital signal processing technology which is used to remove the acoustic echo from a speaker phone in two-way or multi-way communication systems, such as traditional telephone or modern internet audio conversation applications.
FIG. 1 illustrates an example of one end 105 of a typical two-way communication system, which includes a capture stream path and a render stream path for the audio data in the two directions. The other end is exactly the same. In the capture stream path in the figure, an analog to digital (A/D) converter 122 converts the analog sound captured by microphone 110 to digital audio samples continuously at a sampling rate (fsmic). The digital audio samples are saved in capture buffer 130 sample by sample. The samples are retrieved from capture buffer in frame increments (herein denoted as “mic[n]”). Frame here means a number (n) of digital audio samples. Finally, samples in mic[n] are processed and sent to the other end.
In the render stream path, the system receives audio samples from the other end, and places them into a render buffer 140 in periodic frame increments (labeled “spk[n]” in the figure). Then the digital to analog (D/A) converter 150 reads audio samples from the render buffer sample by sample and converts them to analog signal continuously at a sampling rate, fsspk. Finally, the analog signal is played by speaker 160.
In systems such as that depicted by FIG. 1, the near end user's voice is captured by the microphone 110 and sent to the other end. At the same time, the far end user's voice is transmitted through the network to the near end, and played through the speaker 160 or headphone. In this way, both users can hear each other and two-way communication is established. But, a problem occurs if a speaker is used instead of a headphone to play the other end's voice. For example, if the near end user uses a speaker as shown in FIG. 1, his microphone captures not only his voice but also an echo of the sound played from the speaker (labeled as “echo (t)”). In this case, the mic[n] signal that is sent to the far end user includes an echo of the far end user's voice. As the result, the far end user would hear a delayed echo of his or her voice, which is likely to cause annoyance and provide a poor user experience to that user.
Practically, the echo echo(t) can be represented by speaker signal spk(t) convolved by a linear response g(t) (assuming the room can be approximately modeled as a finite duration linear plant) as per the following equation:echo (t)=spk(t)*g(t)=∫0Teg(τ)·spk(t−τ)dτ  (1)where * means convolution, Te is the echo length or filter length of the room response. The room response g(t) is often called the “echo path.”
In order to remove the echo for the remote user, AEC 250 is added to the end 100 of the system shown in FIG. 2. When a frame of samples in the mic[n] signal is retrieved from the capture buffer 130, they are sent to the AEC 250. At the same time, when a frame of samples in the spk[n] signal is sent to the render buffer 140, they are also sent to the AEC 250. The AEC 250 uses the spk[n] signal from the far end to predict the echo in the captured mic[n] signal. Then, the AEC 250 subtracts the predicted echo from the mic[n] signal. This difference or residual is the clear voice signal (voice[n], which is theoretically echo free and very close to near end user's voice (voice(t)).
FIG. 3 depicts an implementation of the AEC 250 based on an adaptive filter 310. The AEC 250 takes two inputs, the microphone signal mic[n], which contains the echo and the near-end voice, and the spk[n] signal, which is received from the far end. The spk[n] signal is used to predict the echo signal. The prediction residual signal e[n] is used to adaptively update the cancellation filter h[n] when there is no near-end voice present. The prediction residual signal e[n] is also output by the adaptive filter. When a near-end voice is present, e[n] contains the echo-free, clear near-end voice, which is sent to the far end. Adaptive filter 310 is also referred to as adaptive echo canceller.
The actual room response (that is represented as g(t) in the above convolution equation) usually varies with time, such as due to change in position of the microphone 110 or speaker 160, body movement of the near end user, and even room temperature. The room response therefore cannot be pre-determined, and must be calculated adaptively at running time. The AEC 250 commonly is based on adaptive filters such as Least Mean Square (LMS) adaptive filters 310, which can adaptively model the varying room response. The LMS algorithm is a least square stochastic gradient step method which, as it is both efficient and robust, is often used in many real-time applications. The LMS algorithm and its well known variations (e.g., the Normalized LMS, or NLMS algorithm) do have certain drawbacks, however. For example, the LMS and other known algorithms can sometimes be slow to converge (i.e., approach the target filtering characteristic, such as the acoustic echo path in a hands-free telephony application), particularly when the algorithm is adapted, or trained, based on a non-white, or colored, input signal such as a human speech signal. Moreover, the order of the adaptive filter (i.e., the number of filter taps) can be quite high in the context of acoustic echo cancellation, and implementation of the adaptive filtering algorithm can therefore be computationally complex.
Consequently, recent work has focused on performing the adaptive filtering in sub-bands. In other words, filter banks are used to divide both the microphone signal and the loudspeaker signal into a number of frequency sub-bands. Each sub-band signal is then decimated, or down-sampled, and adaptive filtering is performed in each sub-band to provide a number of echo-canceled sub-band output signals. The resulting sub-band output signals are then interpolated, or up-sampled, and combined to reconstruct the overall echo-canceled microphone signal for transmission to the far-end user. Advantageously, the sub-sampling results in greater computational efficiency as compared to the full-band processing approach and, since variations in the spectral content of the input signals are less severe within each sub-band, overall convergence speed is also improved.
However, known sub-band adaptive filtering systems suffer from certain disadvantages as well. For example, signal aliasing between sub-bands can result in slow overall convergence and/or errors in the reconstructed microphone signal. In addition, non-causal coefficient effects arising from the sub-band filters' impulse response can reduce the quality of the cancellation process in the individual sub-bands. Consequently, there is a need for improved methods and apparatus for performing sub-band adaptive filtering in echo suppression systems.