1. Technical Field
The invention is related to residual echo suppression in a microphone signal which been previously processed by an acoustic echo canceller (AEC), and more particularly to a regression-based residual echo suppression (RES) system and process for suppressing the portion of the microphone signal corresponding to a playback of a speaker audio signal that was not suppressed by the AEC.
2. Background Art
In teleconferencing applications or speech recognition, a microphone picks up sound that is being played through the speakers. In teleconferencing this leads to perceived echoes, and in speech recognition, reduction in performance. Acoustic Echo Cancellers (AECs) are used to alleviate this problem.
However, the echo reduction provided by AEC is often not sufficient for applications that require a high level of speech quality, such as speech recognition. The insufficient echo reduction is caused by, among other things, adaptive filter lengths in AEC that are much shorter that the room response. Short AEC filters are used to make AEC computationally feasible and to achieve reasonably fast convergence. Various methods have been employed to suppress the residual echo. For example, techniques such as coring (also referred to as center clipping) were used. However, this can lead to near-end speech distortion.
Other methods to remove the residual echo tried to achieve this goal by estimating its power spectral density (PSD), and consequently removing it using Weiner filtering [1,2] or spectral subtraction [3]. However, most of those methods either need prior information about the room, or make unreasonable assumptions about signal properties. For example, some methods estimate PSD based on long-term reverberation models of the room [3]. Parameters of the model are dependent on the room configuration and need to be calculated in advance based on the behavior of the room impulse response.
There are some techniques that estimate the residual echo PSD via a so-called “coherence analysis” which is based on the cross-correlation between the speaker signal (sometimes referred to as the far-end signal in teleconferencing applications) and the residual signal. In a sub-band system, only the discrete Fourier transforms (DFTs) of the windowed signals are available, so the cross-correlations can be only approximately calculated [1]. In [2], the coherence function is computed based on a block of a few frames of data; in [1] it is based on multiple blocks. The latter assumes that the frames of the speaker signal are uncorrelated, which is almost never true. The performance of these algorithms is dictated by the accuracy of the PSD estimate and their ability to track it accurately from one frame to another. The accuracy decreases when near-end speech is present or when the echo path changes.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. A listing of references including the publications corresponding to each designator can be found at the end of the Detailed Description section.