Acoustic Echo Cancellation (AEC) is a digital signal processing technology which is used to remove the acoustic echo from a speaker phone in two-way or multi-way communication systems, such as traditional telephone or modem internet audio conversation applications.
FIG. 1 illustrates an example of one end 100 of a typical two-way communication system, which includes a capture stream path and a render stream path for the audio data in the two directions. The other end is exactly the same. In the capture stream path in the figure, an analog to digital (A/D) converter 120 converts the analog sound captured by microphone 110 to digital audio samples continuously at a sampling rate (fsmic). The digital audio samples are saved in capture buffer 130 sample by sample. The samples are retrieved from capture buffer in frame increments (herein denoted as “mic[n]”). Frame here means a number (n) of digital audio samples. Finally, samples in mic[n] are processed and sent to the other end.
In the render stream path, the system receives audio samples from the other end, and places them into a render buffer 140 in periodic frame increments (labeled “spk[n]” in the figure). Then the digital to analog (D/A) converter 150 reads audio samples from the render buffer sample by sample and converts them to analog signal continuously at a sampling rate, fsspk. Finally, the analog signal is played by speaker 160.
As already mentioned, the system includes two buffers: the capture buffer 120 and the render buffer 140. They are necessary because in most communication systems samples in buffers are read and written at different paces. For example, the A/D converter 120 outputs audio samples to the capture buffer sample by sample continuously, but the system retrieves audio samples from the capture buffer frame by frame. This buffering introduces delay. For example, a sample generated by the A/D converter will stay in capture buffer for a short period of time before it is read out. A similar thing happens for the render stream as well. As a special case, if samples in buffers are read and written at the same pace, these buffers are not needed. But, the buffers are always needed in practical systems.
In systems such as that depicted by FIG. 1, the near end user's voice is captured by the microphone 110 and sent to the other end. At the same time, the far end user's voice is transmitted through the network to the near end, and played through the speaker 160 or headphone. In this way, both users can hear each other and two-way communication is established. But, a problem occurs if a speaker is used instead of a headphone to play the other end's voice. For example, if the near end user uses a speaker as shown in FIG. 1, his microphone captures not only his voice but also an echo of the sound played from the speaker (labeled as “echo(t)”). In this case, the mic[n] signal that is sent to the far end user includes an echo of the far end user's voice. As the result, the far end user would hear a delayed echo of his or her voice, which is likely to cause annoyance and provide a poor user experience to that user.
Practically, the echo echo(t) can be represented by speaker signal spk(t) convolved by a linear response g(t) (assuming the room can be approximately modeled as a finite duration linear plant) as per the following equation:
                              echo          ⁡                      (            t            )                          =                                            spk              ⁡                              (                t                )                                      *                          g              ⁡                              (                t                )                                              =                                    ∫              0                              T                e                                      ⁢                                                            g                  ⁡                                      (                    τ                    )                                                  ·                                  spk                  ⁡                                      (                                          t                      -                      τ                                        )                                                              ⁢                              ⅆ                τ                                                                        (        1        )            where * means convolution, Te is the echo length or filter length of the room response.
In order to remove the echo for the remote user, AEC 210 is added in the system as shown in FIG. 2. When a frame of samples in the mic[n] signal is retrieved from the capture buffer 130, they are sent to the AEC 210. At the same time, when a frame of samples in the spk[n] signal is sent to the render buffer 140, they are also sent to the AEC 210. The AEC 210 uses the spk[n] signal from the far end to predict the echo in the captured mic[n] signal. Then, the AEC 210 subtracts the predicted echo from the mic[n] signal. This difference or residual is the clear voice signal (voice[n]), which is theoretically echo free and very close to near end user's voice (voice(t)).
FIG. 3 depicts an implementation of the AEC 210 based on an adaptive filter 310. The AEC 210 takes two inputs, the mic[n] and spk[n] signals. It uses the spk[n] signal to predict the mic[n] signal. The prediction residual (difference of the actual mic[n] signal from the prediction based on spk[n]) is the voice[n] signal, which will be output as echo free voice and sent to the far end.
The actual room response (that is represented as g(t) in the above convolution equation) usually varies with time, such as due to change in position of the microphone 110 or speaker 160, body movement of the near end user, and even room temperature. The room response therefore cannot be pre-determined, and must be calculated adaptively at running time. The AEC 210 commonly is based on adaptive filters such as Least Mean Square (LMS) adaptive filters 310, which can adaptively model the varying room response.