Acoustic Echo Cancellation (AEC) is a digital signal processing technology which is used to remove the acoustic echo from a speaker phone in two-way (full duplex) or multi-way communication systems, such as traditional telephone or modern internet audio conversation applications.
1. Overview of AEC Processing
FIG. 1 illustrates an example of one end 100 of a typical two-way communication system, which includes a capture stream path and a render stream path for the audio data in the two directions. The other end is exactly the same. In the capture stream path in the figure, an analog to digital (A/D) converter 120 converts the analog sound captured by microphone 110 to digital audio samples continuously at a sampling rate (fsmic). The digital audio samples are saved in capture buffer 130 sample by sample. The samples are retrieved from the capture buffer in frame increments (herein denoted as “mic[n]”). Frame here means a number (n) of digital audio samples. Finally, samples in mic[n] are processed and sent to the other end.
In the render stream path, the system receives audio samples from the other end, and places them into a render buffer 140 in periodic frame increments (labeled “spk[n]” in the figure). Then the digital to analog (D/A) converter 150 reads audio samples from the render buffer sample by sample and converts them to an analog signal continuously at a sampling rate, fsspk. Finally, the analog signal is played by speaker 160.
In systems such as that depicted by FIG. 1, the near end user's voice is captured by the microphone 110 and sent to the other end. At the same time, the far end user's voice is transmitted through the network to the near end, and played through the speaker 160 or headphone. In this way, both users can hear each other and two-way communication is established. But, a problem occurs if a speaker is used instead of a headphone to play the other end's voice. For example, if the near end user uses a speaker as shown in FIG. 1, his microphone captures not only his voice but also an echo of the sound played from the speaker (labeled as “echo(t)”). In this case, the mic[n] signal that is sent to the far end user includes an echo of the far end user's voice. As the result, the far end user would hear a delayed echo of his or her voice, which is likely to cause annoyance and provide a poor user experience to that user.
Practically, the echo echo(t) can be represented by speaker signal spk(t) convolved by a linear response g(t) (assuming the room can be approximately modeled as a finite duration linear plant) as per the following equation:echo(t)=spk(t)*g(t)=∫0Teg(τ)·spk(t−τ)dτwhere * means convolution, Te is the echo length or filter length of the room response.
In order to remove the echo for the remote user, AEC 210 is added in the system as shown in FIG. 2. When a frame of samples in the mic[n] signal is retrieved from the capture buffer 130, they are sent to the AEC 210. At the same time, when a frame of samples in the spk[n] signal is sent to the render buffer 140, they are also sent to the AEC 210. The AEC 210 uses the spk[n] signal from the far end to predict the echo in the captured mic[n] signal. Then, the AEC 210 subtracts the predicted echo from the mic[n] signal. This difference or residual is the clear voice signal (voice[n]), which is theoretically echo free and very close to the near end user's voice (voice(t)).
FIG. 3 depicts an implementation of the AEC 210 based on an adaptive filter 310. The AEC 210 takes two inputs, the mic[n] and spk[n] signals. It uses the spk[n] signal to predict the mic[n] signal. The prediction residual (difference of the actual mic[n] signal from the prediction based on spk[n]) is the voice[n] signal, which will be output as echo free voice and sent to the far end.
The actual room response (that is represented as g(t) in the above convolution equation) usually varies with time, such as due to change in position of the microphone 110 or speaker 160, body movement of the near end user, and even room temperature. The room response therefore cannot be pre-determined, and must be calculated adaptively at running time. The AEC 210 commonly is based on adaptive filters such as Least Mean Square (LMS) adaptive filters 310, which can adaptively model the varying room response.
In addition to AEC, these voice communications systems (e.g., alternative system 400 shown in FIG. 4) may also provide center clipping (CC) 410 processing of the audio signal captured by the microphone. Center clipping further reduces echo (after acoustic echo cancellation 210) by setting the magnitude of the processed microphone signal equal to the magnitude of the background noise floor during periods when echo is detected, but no near-end speech is present.
The full-duplex communication experience can be further improved by two additional processes for processing the near-end speech signal. These include a residual echo suppression (RES) process that further suppresses the acoustic echo from the speakers; and a microphone array process that improves the signal to noise ratio of the speech captured from multiple microphones. One subcomponent of the microphone array process is a sound source localization (SSL) process used to estimate the direction of arrival (DOA) of the near-end, speech signal.
2. Overview of RES Processing
FIG. 5 illustrates a conventional residual echo suppression (RES) process 500. The representative RES process is applied to a microphone channel c following the AEC process.
Upon starting, the RES process is initialized (510) to the following initial state of weight (wc), complex, frequency domain AEC residual (Xc(f,t)), and far-end signal power (Pc(f,t)):wc(0)=0Xc(f,t)=0 for t≦0Pc(f,0)=∥Xc(f,1)∥2where f is the individual frequency band and t is the frame index. Here, the weight is a factor applied in the RES process to predict the residual signal magnitude. The complex AEC residual is the output produced by the previous AEC process in the microphone channel. The far-end signal power is the power of the far-end signal calculated in the RES process.
As indicated at 520, 590 in FIG. 5, the RES process 500 repeats a processing loop (actions 530-580) for each frame t=1, . . . , ∞ of the microphone channel. In the processing loop, the RES process 500 first predicts (action 530) the residual signal magnitude estimate by the equation,
                    R        ^            c        ⁡          (              f        ,        t            )        =            ∑              i        =        0                    L        -        1              ⁢                  ⁢                            w                      c            ,            i                          ⁡                  (          t          )                    ⁢                                            X            c                    ⁡                      (                          f              ,                              t                -                i                                      )                                      
At action 540, the RES process 500 computes the error signal as a function of the magnitude of the AEC residual Mc, the residual signal magnitude estimate, and the noise floor NFc(f,t), via the equation:Ec(f,t)=max(|Mc(f,t)|−{circumflex over (R)}c(f,t),NFc(f,t)))  (1)
At action 550, the RES process 500 computes the smoothed far-end signal power using the calculation,Pc(f,t)=αPc(f,t−1)+(1−α)∥Xc(f,t)|2
At action 560, the RES process 500 computes the normalized gradient
            ∇      c        ⁢          (      t      )        =                    -        2            ⁢                          ⁢                        E          c                ⁡                  (                      f            ,            t                    )                    ⁢                                            X            c                    ⁡                      (                          f              ,              t                        )                                                      P        c            ⁡              (                  f          ,          t                )            
At action 570, the RES process 500 updates the weight with the following equation,
            w      c        ⁡          (              t        +        1            )        =                    w        c            ⁡              (        t        )              -                  μ        2            ⁢                        ∇          c                ⁢                  (          t          )                    
At action 580, the RES process 500 applies the gain to the AEC Residual phase to produce the RES process output (Bc(f,t)) using the following calculation, where φc(f,t) is the phase of the complex AEC residual, Xc(f,t),Bc(f,t)=Ec(f,t)ejφc(f,t)  (2)
The RES process 500 then continues to repeat this processing loop (actions 530-580) for a subsequent frame as indicated at 590. With this processing, the RES process is intended to predict and remove residual echo remaining from the preceding acoustic echo cancellation applied on the microphone channel. However, the RES process 500 includes a non-linear operation, i.e., the “max” operator in equation (1). The presence of this non-linear operation in the RES process 500 may introduce non-linear phase effects in the microphone channel that can adversely affect the performance of subsequent processes that depend upon phase and delay of the microphone channel, including the SSL/MA process.
3. Overview of Center Clipping
FIG. 6 shows an overview of a center clipping (CC) process 600 that is known in the art. The illustrated CC process performs center clipping for a single frequency band. The CC process 600 is run separately on each of the frequency bands on which center clipping is to be performed.
The CC process 600 operates as follows. A multiplier block 610 multiplies an estimate of the peak speaker signal (“SpkPeak”) by an estimate of the speaker to microphone gain (“SpkToMicGain”), producing a leak through estimate (“leak through”) of the peak speaker echo in the microphone signal. Next, the leak through estimate is filtered across the neighboring frequency bands in block 620 to produce a filtered leak through estimate. Reverberance block 630 scales the filtered, leak through estimate by another parameter to account for the amount of reverberance in the particular band, yielding the value labeled as A.
In parallel, filter blocks 640, 650 separately filter the instantaneous microphone power and residual power across the neighboring frequency bands. Block 660 selects the minimum of the two filtered power results to produce the value labeled as B. As indicated at block 670, if A>B, then a flag is set to 1. Otherwise, the flag is set to 0. If the flag is set to 0, then the AEC residual for band f is not changed, and block 680 outputs the AEC residual as the CC process output. However, if the flag is set to 1, then block 680 instead sets the AEC residual in band f to a complex value with the magnitude equal to the background noise floor, and the phase is set to a random value between 0 and 360 degrees produced at block 690.
This conventional center clipping process is designed to operate on a single microphone channel. This conventional process does not allow for the multiple microphone channels in a microphone array, due to the non-linear process in block 680. In addition, a microphone array has separate speaker to microphone gain for each microphone channel, and separate instantaneous microphone power.