In voice communication, acoustic echo mitigation is a great challenge. Acoustic echo is caused by the fact that the microphone may re-capture the audio signal played out by the loudspeaker (loudspeaker signal or reference signal), and thus the talker on the other side (far-end) will hear his own voice together with the input of the near end.
Conventionally there are two fundamental techniques for mitigating acoustic echo. One is acoustic echo cancellation (AEC) and the other is acoustic echo suppression (AES). Nowadays AEC is generally used to cancel most acoustic echo from the microphone signal and AES is generally used to further suppress residual echo in the error signal obtained after the AEC processing. AES might be used alone when low complexity or robustness to minor echo path changes is desired (Christof Faller, Jingdong Chen: Suppressing Acoustic Echo in a Spectral Envelope Space. IEEE Transactions on Speech and Audio Processing 13(5-2): 1048-1062 (2005), the entirety of which is incorporated herein by reference).
The proper operation of AES depends on proper gains obtained based on residual echo power estimated from the error signal output from AEC. However, it is a challenging task to estimate the residual echo power with both robustness and swiftness-due to power change in the error signal, which may be caused by various factors, such as noise, double talk (or near end talk), change of properties of the echo path (LEM, Loudspeaker-Enclosure-Microphone) such as switching between headset and loudspeaker, and etc.
One solution is to employ a simple hard-decision voice activity detector for a double talk flag, then the AES may be adjusted depending on the flag so that near end talk will not be regarded as residual echo and suppressed erroneously. An example may be found in Makoto Shozakai et al., U.S. Pat. No. 7,440,891, patented on Oct. 21, 2008 and originally assigned to Asahi Kasei Kabushiki Kaisha, titled “Speech Processing Method and Apparatus for Improving Speech Quality and Speech Recognition Performance”, the entirety of which is incorporated herein by reference. However, in such a solution, the hard-decision flag regarding double talk would depend on experiential selection of a threshold, which usually cannot meet requirements in all scenarios. Furthermore, such a solution tends to confuse double talk with other changes, such as echo path change, or noise level change which also result in power change in the error signal.