In a hands-free telephone, the far end acoustic signal can cause undesired feedback. This feedback can be neutralized by an appropriate echo suppression device. One such device, known as an acoustic echo canceller allows full duplex communication, but it requires significant computational resources and may not always provide enough echo attenuation. For example, under optimal conditions an acoustic echo canceller may provide a maximum echo reduction of 25 to 30 dB, whereas an optimal hands-free telephone conversation needs the echo level to be reduced by 40 to 45 dB. Therefore, acoustical echo cancellers in telecommunication devices are typically complemented with a so-called post-processor.
FIG. 1 shows a generic echo canceller arrangement with such a post-processor. The input signal d(k) is a combination of acoustical echo y(k), local speech s(k), and background noise n(k):d(k)=y(k)+s(k)+n(k)  (1)The echo cancelled residue e(k), in FIG. 1, is composed of a residual echo ε(k), the local speech signal s(k), and the background noise n(k), where ε(k)=y(k)−ŷ(k). The post-processor further suppresses the residual echo level after the echo canceller. This is commonly realized by a non-linear action, such as loss insertion, center clipping, etc. That typically means attenuating the signal at the output of the echo canceller. But, together with the residual echo level, the other signal components at the output of post-processor are also attenuated.
To avoid attenuating the local speech signal in the send path of the echo canceller, the operation of the post-processor may be controlled by a “voice activity detector” (VAD) that attempts to determine whether the local speaker is active or not. In the former case, the post-processor is not used and the echo residue is assumed to be masked by the local speech. In the latter case, the post-processor suppresses the residual echo to an acceptable low level. But, VAD-controlled post-processors are difficult to control, and give rise to artefacts such as chopping and clipping of local speech. Also, the noise component n(k) is not taken into account in the on/off decision of the post-processor, so the performance of such post-processors in noisy circumstances is rather poor—during local speech, the background noise passes through without attenuation; when the local speaker is not active, the background noise is suddenly shut off because it is suppressed along with the residual echo.
An echo shaping technique was suggested by R. Martin and S. Gustafsson in An Improved Echo Shaping Algorithm for Acoustic Echo Control, Proceedings of European Signal Processing Conference-96, pp. 25–28, September 11–13, Trieste, Italy, 1996, (hereinafter, “Martin and Gustafsson”), the contents of which are incorporated herein by reference. Martin and Gustafsson suggest using an adaptive echo shaping filter placed in the echo canceller send path for a post-processor. This creates a “soft decision directed” residual echo suppressor that does not exhibit the “on/off” behavior found in classical residual echo suppressors. As a result, quality of speech (i.e. the observed distortions of local speech) has been found to be better than what can be achieved with classical post-processors. As an additional advantage, the proposed echo shaping filter may largely compensate for poor performance of the echo canceller. Therefore it has been suggested to design an echo controller having of a relatively low order echo canceller (typically, 20 coefficients) followed by the echo shaping filter.
FIG. 2 presents a block diagram of an echo canceller EC combined with an echo shaping filter H. The echo shaping technique employs two low order finite impulse response (FIR) filters: background filter H1 is an adaptive filter that is updated in the background, its contents are copied into the postfilter H, which filters the echo canceller EC residue e(k). The updates to back ground filter H1 have to be controlled so that frequencies of e(k) are attenuated only where the echo residue ε(k) has more power than the local speech s(k). Thus, echo shaping filter H has to attenuate the echo residue ε(k) at those frequencies where it is particularly audible, while at the same time the distortion of the local speech s(k) must be kept at an acceptable level. Therefore the key issue of the echo shaping technique is how the background filter H1 is updated.
Background filter H1 is a low-order (typically 20 coefficients) FIR filter that is updated by adaptation following a normalized least-mean square (NLMS) algorithm. The reference signal z(k) of the background filter H1 is synthesized as a combination of the microphone signal d(k) and the echo canceller EC residue e(k) as follows:z(k)=α(k)d(k)+(1−α(k))e(k),  (2)where α(k) is a time varying non-negative control factor that is determined by an “adaptive control” mechanism.
Since e(k)=d(k)−ŷ(k), (2) can also be written asz(k)=e(k)+α(k)ŷ(k)  (3)Thus by changing the control factor α(k), the contribution of an estimate of the echo in the synthesized signal z(k) can be controlled. (In contrast to the echo canceller EC, since for the adaptation of the echo shaping filter H it is not important to dispose of an exact echo estimate in terms of amplitude and phase. Because of the adaptive control mechanism in the control factor α(k) it is sufficient to have a rough idea of the energy in the echo.) When α(k)=0, z(k)=e(k) and the NLMS algorithm will adapt background filter H1 such that it changes into an all-pass filter. Thus, echo shaping filter H will have no influence on the echo canceller EC residue. This is the preferred case where only the local speaker is active, or where both speakers are active but the echo canceller EC has already achieved a significant reduction of the echo. By increasing the control factor α(k), the relative contribution of the echo to z(k) is increased. This also implies that the relative contribution of the echo is increased in the background filter error signal eh(k). Since the NLMS algorithm will adapt the background filter H1 so that it attempts to strongly attenuate this error contribution, echo shaping filter H also will strongly attenuate the residual echo in e(k).
Clearly, a key aspect is the control algorithm for the control factor α(k). During single far talk, α(k) should be as high as possible, whereas when only the local speaker is active, it should be close to zero. During double talk, an appropriate value for α(k) must be used so that attenuation of the local speech is avoided while at the same time the echo residue is attenuated at frequencies where it is not masked by local speech.
Martin and Gustafsson proposed two control algorithms for the control factor α(k). The first one, which will be referred to as MG1, was designed to explicitly account for the degree of echo attenuation already achieved by the echo canceller EC in order to avoid unnecessary local speech level modulations. It turned out, however, that this MG1 control algorithm is very sensitive to estimation errors. That is because good estimates of the echo attenuation achieved by the echo canceller EC are not easily obtained—especially during double talk. Therefore, the MG1 algorithm is not practically relevant.
The second control algorithm, which will be referred to as MG2, calculates the control factor α(k) as the ratio of the momentarily power of the estimated echo and the momentarily power of the echo canceller EC residue. While this algorithm is very simple to implement, it has an important drawback; that is that α(k) tends to fluctuate very strongly and in a large range (>1e5). Theoretically, the control factor α(k) is only limited to being non-negative, so at first sight this does not seem to be very constraining. However, in practice it has been observed that an upper limit should be placed on α(k) for reasons of stability. Furthermore, due to the large fluctuations in the contributions to z(k), hence eh(k), the NLMS algorithm has to “work very hard” to update background filter H1 to the continually changing conditions. Although the background filter H1 is rather short, it has been observed that the NLMS algorithm must be run with a rather large convergence coefficient in order to achieve the necessary convergence speed. This also gives rise to a lot of instabilities. Finally, the proposed MG2 control algorithm tends to be fairly aggressive, and often attenuates low-level local speech (e.g. it chops soft speech onsets, etc.). Another consequence of being so aggressive is that the MG2 algorithm doesn't work well in the presence of significant background noise where it gives rise to annoying modulations similar in character to the switching modulations of a classical suppressor.
Thus, the problem with the MG2 algorithm is that it can be far too aggressive. This can be illustrated by plotting the attenuation characteristic of the echo shaping filter using different control algorithms, and for different levels of echo cancellation (the so-called ERLE—echo return loss enhancement) achieved by the echo canceller EC.
FIG. 3 presents the attenuation characteristic of the echo shaping filter H for control algorithm MG1. The attenuation characteristic has been plotted as a function of the parameter ρ(ω), where ρ(ω) denotes the ratio of the local speech plus background noise to the echo:                                           ρ            ⁢                                                  ⁢                          (              ω              )                                =                                                                      R                  ss                                ⁢                                                                  ⁢                                  (                  ω                  )                                            +                                                R                  nn                                ⁢                                                                  ⁢                                  (                  ω                  )                                                                                    R                yy                            ⁢                                                          ⁢                              (                ω                )                                                    ,                            (        4        )            where Rss(ω) is the auto-power spectral density of the local speech signal, etc. As shown in FIG. 3, the MG1 algorithm realizes a near-to-optimal behavior, that as soon as the local speech (plus background noise) level is lower than the echo level (i.e. ρ(ω)<1), the attenuation achieved by the echo shaping filter H increases. This compensates for the fact that in such a case the local speech would not efficiently mask the residual echo. Also, the lower the ERLE, the more attenuation is achieved by the echo shaping filter H, thus compensating for the shortcomings of the echo canceller EC. Unfortunately, as discussed above, it is not possible to achieve the predicted behavior for the MG1 algorithm in practice.
The attenuation characteristic of the echo shaping filter H when its updates are controlled by the MG2 algorithm is shown FIG. 4. This shows that MG2 is a rather brute force solution that gives rise to very high attenuation in many conditions. Also, the attenuation curves start to rise when ρ(ω)>1, showing why low level local speech is sometimes chopped as well. Moreover, the attenuation achieved by the echo shaping filter H increases together with increasing ERLE which is rather undesirable. NOTE: The attenuation characteristic presented in Figure does not resemble the one presented by Martin and Gustafsson, which, however, has been found to be incorrect. Therefore some of the conclusions drawn by Martin and Gustafsson with respect to the MG2 algorithm are not correct either.
Thus, two deficiencies of the Martin and Gustafsson echo shaping approach have been observed: (1) the proposed algorithm was not always stable, and (2) the proposed algorithm still gives rise to annoying noise modulations. In order to cope with (2), Martin and Gardner proposed using a comfort noise generator (CNG). The CNG is run at the output of the post-processor, and adds noise to the post-processor output during local speech pauses so that the observed background noise level (and ideally, the noise spectrum) is the same during both local speech and local speech pauses. This approach has the drawbacks that the complexity of the combined echo shaping filter and CNG increases (especially if the background noise is synthesized accurately), and the operation of the CNG is again VAD-driven so that artefacts (due to mistakes made by the VAD) must—again—be anticipated.