The present invention relates generally to the field of acoustic echo cancellation and more particularly to an improved method for detecting double-talk in acoustic echo cancellation systems.
With the increasingly commonplace use of speakerphones and teleconferencing. acoustic echo cancellation has recently become a topic of critical importance. In particular, an acoustic echo canceller (AEC) ideally removes the undesired echo signal that invariably feeds back from the loudspeaker to the microphone which are used in full-duplex hands-free telecommunications systems. In particular, echo cancellation is performed by modeling the echo path impulse response with an adaptive finite impulse response (FIR) filter, fully familiar to those of ordinary skill in the art, and subtracting the computed echo estimate from the microphone output signal (i.e., the return signal). FIG. 1 shows a diagram of an illustrative single-channel AEC. (In many cases, stereo echo cancellers are used, but in the context of the instant problem and the present invention, the use of a single-channel teleconferencing system will be adequate for purposes of understanding the invention.) The contents and operation of FIG. 1 will be described in detail below.
More specifically, an acoustic echo canceller mitigates the echo effect by adjusting the transfer function (i.e., the impulse response characteristic) of the adaptive filter in order to generate an estimate of the unwanted return signal. That is, the filter is adapted to mimic the effective transfer function of the acoustic path from the loudspeaker to the microphone. As such, by filtering the incoming signal (i.e., the signal coming from the far-endxe2x80x94shown as x(n) in FIG. 1), the output of the filter will estimate the unwanted return signal which comprises the echo (shown as y(n) in FIG. 1). Then, this estimate is subtracted from the outgoing signal (i.e., the return signal) to produce an error signal (shown as e(n) in FIG. 1). By adapting the filter impulse response characteristic such that the error signal approaches zero, the echo is advantageously reduced or eliminated. That is, the filter coefficients, and hence the estimate of the unwanted echo, are updated in response to continuously received samples of the error signal for more closely effectuating as complete a cancellation of the echo as possible.
Additionally, double-talk detectors (DTD) are generally used in echo cancellers in order to disable the filter adaptation during double-talk conditions. That is, when both the near end party and the far end party to a conversation taking place across a telecommunications line speak simultaneously, it would be clearly undesirable to attempt to minimize the entire xe2x80x9cerror signal,xe2x80x9d since that signal now also includes the xe2x80x9cdouble-talkxe2x80x9d (i.e., the speech of the near-end speaker, shown as v(n) in FIG. 1). More specifically, the function of a double-talk detector is to recognize that double-talk is occurring, and to stop the filter from further adaptation until the double-talk situation ceases.
The basic double-talk detection scheme typically comprises the computation of a xe2x80x9cdetection statisticxe2x80x9d and the comparison of that statistic with a predetermined threshold. Various prior art methods have been employed to form the detection statistic, each typically using the far-end speech signal, x(n), and the return signal, y(n), as the basis for computing the statistic. (Some approaches use the error signal, e(n) rather than the return signal y(n), which provides essentially the same information.) Obviously, if there were no echo (i.e., the signal from the loudspeaker to the microphone remained totally undisturbed, or equivalently, the effective transfer function, h(n), of the receiving room were unity), and if furthermore there were no background noise, w(n), ill the receiving room, then signals x(n) and y(n) would be identical if and only if there were no double-talk (i.e., x(n)=y(n) it and only if v(n)=0). Since this is not the case, however, the computation of a useful detection statistic must take the presence of the echo, as well as the possible presence of background noise, into account.
More specifically, the generalized procedure for handling double-talk may be described by the following four steps:
1. A detection statistic "xgr", is formed using the available signals (e.g., x(n), y(n), e(n), etc., and the estimated filter coefficients ĥ);
2. The detection statistics, is compared to a predetermined threshold, T, and double-talk is declared if for example, "xgr" less than T;
3. Once double-talk is detected, it is declared to exist for a minimum period of time, Thold, during which the filter adaptation is disabled; and
4. If, for example, "xgr"xe2x89xa7T continuously for the interval Thold, the filter then resumes adaptation, the comparison of "xgr" to T continues, and double-talk is declared to exist again when, for example, "xgr" less than T.
Note that the use of a hold time Thold in steps 3 and 4 above is advantageously employed in order to suppress detection dropouts due to the potentially noisy behavior of the detection statistic. Although there are some possible variations, most DTD algorithms have this basic form and differ only in their specific formation of the detection statistic (and the corresponding choice of the threshold, T).
One particular prior art approach to the formation of the detection statistic, fully familiar to those skilled in the art, is due to A. A. Geigel. (See, e.g., D. L. Dutweiler, xe2x80x9cA Twelve-Channel Digital Echo Canceller,xe2x80x9d IEEE Trans. Commun., vol. 26, no. 5, pp. 647-653, May 1978. ) Although the Geigel technique has proven successful when used in network echo cancellers, it has often provided less than reliable performance when used in an acoustic echo cancellation application. Specifically, the Geigel DTD declares presence of near-end speech whenever                                           ξ                          (              g              )                                =                                                    max                ⁢                                  {                                                            "LeftBracketingBar"                                              x                        ⁡                                                  (                          n                          )                                                                    "RightBracketingBar"                                        ,                    …                    ⁢                                          xe2x80x83                                        ,                                          "LeftBracketingBar"                                              x                        ⁡                                                  (                                                      n                            -                                                          L                              g                                                        +                            1                                                    )                                                                    "RightBracketingBar"                                                        }                                                            "LeftBracketingBar"                                  y                  ⁡                                      (                    n                    )                                                  "RightBracketingBar"                                       less than             T                          ,                            (        1        )            
where Lg and T (the threshold), are suitably chosen constants. This detection scheme is based on a waveform level comparison between the return signal y(n) and the far-end speech x(n), assuming that the near-end speech v(n) at the microphone signal will be typically at the same level, or stronger, than the echo yxe2x80x2(n). The maximum of the Lg most recent samples of x(n) is taken for the comparison because of the unknown delay in the echo path. The predetermined threshold T compensates for the gain of the echo path response h, and is often set to 2 for network echo cancellers because the hybrid (the echo path) loss is typically about 6 dB or more. For an AEC, however, it is not easy to set a universal threshold to work reliably in all the various situations because the loss through the acoustic echo path can vary greatly depending on many factors. For Lg, one easy choice is to set it the same as the adaptive filter length L since we can assume that the echo path is covered by this length.
Another prior art technique is to form the detection statistic based on the cross-correlation coefficient vector between the signals x(n) and e(n). (See, e.g., H. Ye et a(., xe2x80x9cA New Double-Talk Detection Algorithm Based on the Orthogonality Theorem,xe2x80x9d IEEE Trans. Commun., vol. 39, pp. 1542-1545, November 1991. ) In fact, using the cross-correlation coefficient vector between x(n) and y(n), rather than between x(n) and e(n), has actually proven more robust and reliable. Specifically, the cross-correlation coefficient vector between x(n) and y(n) is defined as:                                                                         c                xy                                  (                  1                  )                                            =                            ⁢                                                E                  ⁢                                      {                                                                  x                        ⁡                                                  (                          n                          )                                                                    ⁢                                              y                        ⁡                                                  (                          n                          )                                                                                      }                                                                                        E                    ⁢                                          {                                                                        x                          2                                                ⁡                                                  (                          n                          )                                                                    }                                        ⁢                    E                    ⁢                                          {                                                                        y                          2                                                ⁡                                                  (                          n                          )                                                                    }                                                                                                                                              =                            ⁢                                                r                  xy                                                                      σ                    x                                    ⁢                                      σ                    y                                                                                                                          =                            ⁢                                                [                                                                                                              c                                                      xy                            ,                            0                                                                                (                            1                            )                                                                                                                                                c                                                      xy                            ,                            1                                                                                (                            1                            )                                                                                                                      ⋯                                                                                              c                                                      xy                            ,                                                          L                              -                              1                                                                                                            (                            1                            )                                                                                                                                ]                                T                                                                        (        2        )            
where E{xc2x7} denotes mathematical expectation and cxy,i(1) is the cross-correlation coefficient between x(nxe2x88x92i) and y(n).
Specifically, the idea here is to compare                                                                         ξ                                  (                  1                  )                                            =                            ⁢                                                "LeftBracketingBar"                  "RightBracketingBar"                                ⁢                                  c                  xy                                      (                    1                    )                                                  ⁢                                                      "LeftBracketingBar"                    "RightBracketingBar"                                    ∞                                                                                                                        =                                ⁢                                                      max                    i                                    ⁢                                      "LeftBracketingBar"                                          c                                              xy                        ,                        i                                                                    (                        1                        )                                                              "RightBracketingBar"                                                              ,                              xe2x80x83                            ⁢                              i                =                0                            ,              1              ,              ⋯              ⁢                              xe2x80x83                            ,                              L                -                1                                                                        (        3        )            
to a threshold level, T. The decision rule is simply as follows: if "xgr"(1)xe2x89xa7T, then double-talk is not present; if "xgr"(1) less than T, then double-talk is present.
Although the l∞ norm is perhaps the most natural, other scalar metrics, such as, for example, l1 or l2, could alternatively be used to assess the cross-correlation coefficient vectors. However, there is a fundamental problem with this approach which is not linked to the type of metric used. The problem is that these cross-correlation coefficient vectors are not well normalized. Indeed, we can only say in general that "xgr"(1)xe2x89xa61. Thus if v(n)=0, that does not imply that "xgr"(1)=1 or any other known value. We do not know the value of "xgr"(1) in general. The amount of correlation will depend a great deal on the statistics of the signals and of the echo path. As a result, the best value of T will vary a great deal from one situation to another. Thus, there is no xe2x80x9cnaturalxe2x80x9d threshold level which can be associated with the variable "xgr"(1) when v(n)=0.
For these reasons, it would be desirable to provide a double-talk detection scheme which employs a detection statistic and method which overcomes the above limitations of prior art techniques. In particular, note that the decision variable "xgr" used in double-talk detection should advantageously behave as follows:
1. If v(n)=0 (double-talk is not present), "xgr"xe2x89xa7T;
2. If v(n)xe2x89xa00 (double-talk is present), "xgr" less than T; and
3. "xgr" is insensitive to variations in the echo path.
Also note that the threshold T should advantageously be a constant, independent of the data. Moreover, it is desirable that the decisions are made without introducing delay (or at least minimizing the introduced delay) in the updating of the model filter, since delayed decisions will adversely affect the performance of the AEC.
In accordance with the principles of the present invention, it has been realized that double-talk detection may be advantageously performed based on a cross-correlation between the far-end signalxe2x80x94illustratively, signal x(n) in FIG. 1 and the return signalxe2x80x94illustratively, signal y(n) of FIG. 1xe2x80x94which is, in particular, normalized with use of a covariance (i.e., autocorrelation) matrix of the far-end signal. More particularly, in accordance with the present invention, a detection statistic is advantageously computed based on an estimate of a cross-correlation between the far-end signal and the return signal normalized by a covariance matrix of the far-end signal. In accordance with certain illustrative embodiments of the present invention, the estimate of the cross-correlation between the far-end signal and the return signal may be further normalized with use of either an estimate of a variance of the return signal or an estimate of a covariance matrix of the return signal. In some illustrative embodiments of the invention, one or more of these quantities may be advantageously estimated based on signal samples sampled over a predetermined time window. And in other illustrative embodiments of the present invention, the coefficients of the adaptive filter employed in the acoustic echo canceller itself may be advantageously employed to compute the detection statistic.
In comparison with prior art techniques, performing double-talk detection by estimating such a cross-correlation of the far-end signal and the return signal which has been normalized with use of a covariance matrix of the far-end signal in accordance with certain embodiments of the present invention achieves a more proper normalization in that the resultant detection statistic will be equal to one when the near-end signal (i.e., the double-talk) is zero. Thus, a double-talk detection procedure formulated in accordance with the principles of the present invention (i.e., using a detection statistic "xgr" computed in accordance with these principles) can be advantageously designed to behave according to the beneficial properties listed abovexe2x80x94that is, given a properly chosen threshold T, which may advantageously be a constant, independent of the data, it can be ensured that:
1. If v(n)=0 (double-talk is not present), "xgr"xe2x89xa7T;
2. If v(n)xe2x89xa00 (double-talk is present), "xgr" less than T; and
3. "xgr" is insensitive to variations in the echo path.