“Acoustic echo” arises from the acoustic properties of a room as in a teleconferencing system. Acoustic echo originating at a near end location will be heard by a far end observer. Acoustic echo originating at a far end will be heard at the near end. An example with acoustic echo originating at the near end, while the far end transducers are acoustically isolated is shown in FIG. 1. From that figure it can be seen that acoustic echo will be heard at the far end of a conferencing system when the far end talker's voice gets acoustically coupled with the near end talker's microphone at the near end room. This coupling is almost always unavoidable. However, the strength of the coupling is affected by the designer's setup of the room and any acoustic treatment applied to the room. “Line echo” originates from the physical transmission of a signal between the near and far ends through the Public Switched Telephone system. From FIGS. 2a, 2b, and 2c it is apparent that the line echo source comes from the Telephone Company transmitting and receiving two signals over a single wire while not accurately matching an internal balance impedance. The type of transmission system that allows transmit and receive signals to affect each other in this manner is referred to as a “two-wire” system. If the transmit and receive signals are isolated from each other it is called a “four-wire” system. The line echo is said to arise at the interface point of a “two-wire” and “four-wire” system. It is possible for a signal to go through multiple two-wire to four-wire interfaces on its trek between near and far ends.
A main factor affecting the severity of both types of echo is the amount of time before the talker hears his own voice back as an echo. The longer the delay the more perceptible the echo. For line echo the delay will increase as the phone company digitally processes a signal or as it is physically routed over longer distances. For acoustic echo both the acoustic path delay and transmission delay contribute to the delay of the echo.
Both types of echo problems (line and acoustic) are conventionally addressed using an echo cancellation system in generally the same way. However, the specific differences in the sources of echo require differences in their solutions.
In both types of echo cancellers an adaptive filter is typically used to create a model of the echo path. The model is formed by sending a reference signal through an estimated model of the echo path at the same time it is sent though the actual echo path and forming an error between them. This error is used to adapt the filter until the model becomes accurate (the error is minimized). When the model becomes accurate the echo is also minimized. Obviously the echo estimate has to be rather accurate to have a significant effect. If the model is poor the echo can actually be enhanced instead of attenuated.
The adaptive model can be formed off line when no one is talking by sending a random noise sequence through the path being modeled as well as through the adaptive filter. A gradient search can be performed on the received signal to minimize the mean squared error difference between the actual and modeled paths. When the path changes significantly this “training” process can be repeated. This process is time consuming and very annoying to those located at the near end. An acoustic path changes more often than an electrical path so this “off line” training method is used for line echo cancellers a little more often than acoustic ones.
A conventional method of forming the adaptive model is to use the actual speech signal being transmitted to the near end location as the training signal. This is much more desirable in an echo cancellation system because of its non-intrusive nature. However, great care must be taken so that the on-line training is done only when noise (which includes near end speech) is minimal. Any excessive noise will cause divergence in the filter model being formed from the actual echo path, reducing the effectiveness of the echo cancellor. The primary cause of noise at the near end is speech originating at the near end. Most echo cancellers employ a “double-talk” detector to determine if it is safe to adapt the filter model or not. Examples of “double talk” detectors are described in U.S. Pat. Nos. 5,535,194, entitled “Method and Apparatus for Echo Cancelling with Double Talk Immunity”, issued Jul. 9, 1996; and 5,295,136, entitled “Method of Performing Convergence in a Least Mean Square Adaptive Filter Echo Canceller”, issued Mar. 15, 1994; both assigned to Motorola, Inc., the disclosures of which are incorporated herein by this reference.
The number of filter weights (also called taps or filter coefficients) is directly related to the amount of delay that can be modeled by the adaptive filter. The amount of delay that can be handled by the adaptive filter is referred to as the echo canceller's “tail length”. An acoustic echo canceller is located at the acoustic source of echo so that it will not have to model the transmission delay. The source of line echo is electrical in nature and usually requires a smaller tail length than an acoustic echo canceller. Line echo cancellers usually require a tail length on the order of 30 ms. An acoustic echo canceller requires a tail length anywhere from four to eight times the length of the line echo canceller (120 to 240 ms).
Complete echo removal in an actual system is unachievable so echo suppressors are placed after the adaptive filter to remove any remaining echo. Suppressors work by strategically attenuating a signal being sent to the far end when far end speech is present. Echo suppression is a non-linear process.
The adaptive filter is only capable of modeling a linear system. If suppressors or other non-linearities are in the path being modeled then poor performance will result.
Sub-Band Echo Cancellers
Sometimes the adaptive filter model is broken up into multiple frequency bands using a separate adaptive filter for each band. This sub-band approach is obviously more complex but has two potential benefits, and a drawback.
A sub-band echo canceller speeds up the training process when speech is used. Speech utterances have limited frequency content. If speech is being used to form the adaptive model then the adaptive filter will seek to minimize the error of the dominant frequency components used in the training signal. This results in the “learning time” for the adaptive filter being increased when training is done with speech as opposed to training with random noise. A sub-band echo canceller uses multiple adaptive filters, each operating on a limited band of frequencies. This requires each filter to only be accurate over a limited frequency range, reducing the training time.
A second benefit is that once the processing cycles have been spent for splitting the frequency into sub-bands each band may be processed at a lower sample rate. The lower sample rate makes it possible to increase the number of filter weights processed allowing an increase in the canceller's tail length.
The main drawback to the sub-band approach is an increase in the canceller's signal throughput time. The delay through a standard echo canceller is primarily due to anti-aliasing and reconstruction filters on the converters (about 3 ms to 7 ms). A sub-band system has the delay of the analysis and synthesis sections required to split the signal into sub-bands and sample rate convert each signal (about 30 ms to 40 ms). This is a noticeable time lag that enhances undesirable echo in systems with an otherwise low delay.
Adaptive Filtering
Adaptive filtering may be done in either the time or frequency domains. There are many adaptive weight update techniques in use today. Due to its minimal computational requirements, the most common adaptive filter weight update algorithm is the time domain Least Mean Square, or LMS, adaptive update algorithm. If the training signal used has a large dynamic range (such as speech) then a Normalized Least Mean Square (NLMS) algorithm is frequently used.
A comprehensive discussion of the NLMS algorithm may be found in Adaptive Filter Theory, 3rd Ed., by Simon Haykin, Prentice Hall, 1996 (pgs 432-439). A summary of the algorithm appears on page 437 of that reference.
Coefficient Leakage over time, the adaptive filter coefficients may slowly drift away from their adapted solution. To ensure long term stability for an adaptive filter, “coefficient leakage” is often used. In essence, a small percentage of the filter coefficient values are reduced or leaked out over time. A discussion of coefficient leakage may be found in Adaptive Filter Theory, 3rd Ed., by Simon Haykin, Prentice Hall, 1996 (pgs 746 747).
Divergence
When an echo canceller tries to adapt in the presence of far end speech (double-talk) or other sporadic or impulsive noise the weights of the adaptive filter diverge from their solution resulting in an increase of echo and noise artifacts. Most echo cancellers rely on the use of a double-talk detector to determine if it is “safe” to adapt the filter weights, and halt adaptation during double-talk to minimize a filter's divergence. Once it becomes sufficiently safe the filter re-adapts, removing any divergence that took place during the time it took to detect the initial double-talk. Such an approach prohibits the use of higher adaptation gains due to the delay in detecting the double-talk. While smaller adaptation gains keeps divergence at a minimum until double-talk can be detected, it also slows the adaptation process, leaving periods of increased echo, when the echo path modeled by the adaptive filter changes. This solution requires the added complexity of a double-talk detector implementation.
There is at least one other major train of thought in the literature on minimizing divergence due to double-talk or other impulsive noise. This method relies on the adaptive filter to decorrelate the input signal Xn from the error signal En, a process that happens normally during filter adaptation. A cross-correlation between Xn and En is formed (a digital signal process). This cross-correlation is monitored and used as a metric in determining if the adaptive filter has sufficiently converged. When the filter is converged there is little correlation between Xn and En. There is much stronger correlation when the adaptive filter is a poor match for the echo path. Once the filter has converged, adaptation is stopped, thus removing the possibility of divergence because it is no longer adapting. When the echo path changes it is reflected in the cross-correlation metric, and adaptation is resumed. This method ignores the presence of double-talk during adaptation and seeks to minimize divergence by adapting only when necessary. If the acoustic path changes often (the case for many acoustic echo applications), or if frequent speech, sporadic or impulsive noise is present (the case when a classroom is the target application) this method provides little benefit. This method also carries an added complexity equivalent to that of the main filter convolution.
A description of double-talk detectors and the use of a cross-correlation metric described above are given in the paper “New Double-Talk Detector”, IEEE Transactions on Communications, Vol. 39, No. 11, November 1991. See also U.S. Pat. No. 5,206,854, entitled “Detecting Loss of Echo Cancellation”, issued Apr. 27, 1993 and assigned to AT&T, which uses a newer method.
There is one other significant method for avoiding the problem of divergence due to adapting the filter in the presence of noise. The method is outlined in the figure labeled as prior art (FIG. #5). A description is disclosed by Ochiai et al. “Echo Canceller with Two Path Models”, IEEE Transactions On Communications,Vol. COM-25, No. 6, June 1977, pp. 589-595. FIG. 5 shows a diagram illustrating the manner in which a digital echo canceller is generally used as part of a teleconferencing system. Ochiai et al. used an adaptive background filter running in parallel with a foreground filter. Each filter produces an estimate of the echo. When the adapted background filter provides an estimate that proves better than the foreground filter, its filter coefficients are copied to the foreground filter location. The echo-attenuated signal is taken from the foreground filter so that divergence due to noise is not heard.
There are a couple of drawbacks to the method they proposed. First, when the background filter's coefficients diverged due to the presence of noise, they would either need to adapt out the diverged signal or be reset to zero and start over in the adaptation process, producing a time delay before improvements could be made to the foreground filter. The other significant problem comes when making decisions to copy the background filter coefficients to the foreground filter. If the adaptation process uses a high gain/fast convergence algorithm, the coefficients will diverge quickly in the presence of noise, causing errors in determining when it is valid to update the foreground set. If a diverged set of coefficients is placed into the foreground filter, system performance is severely degraded.
Computational Burden
To help put the computational burden issue into perspective a discussion of the computational requirements is in order. The computational load (# of required clock cycles) for a Finite Impulse Response (FIR) filter and a LMS update on a Digital Signal Processor (DSP) such as the Motorola 56362 is on the order of 3*N where N is the number of taps of the FIR filter. As stated previously a key feature of an echo canceller is its tail length. The tail length is the amount of time that can be represented by the FIR filter being used to model the microphone to speaker acoustic path where the echo arises. The longer the tail length the larger the acoustic delay that the filter can represent and the larger the room that the echo canceller can handle. The amount of time that can be represented by the FIR filter (Tail Length) is a function of both N and the sample rate that the FIR filter is being processed at. The following equation illustrates this:
 Tail Length in seconds=N*T=N/Sample Rate (Hz).
                Where: N is the # of filter taps and T is the period of the sample rate in seconds        
For example at an 8 kHz sample rate a 2000 tap FIR filter would have a Tail Length of 250 ms. This 2000 tap filter would use 3*N clock cycles.
In the world of real-time processing the constrained resource is time represented by the number of instruction cycles available for signal processing. The Number of instruction cycles M, that are available on a processor such as the 56362 are as follows:M (instruction cycles available for processing)=Processor Clock Speed/Sample Rate of processing
In a DSP system in order to increase bandwidth the sample rate must be increased(see Nyquist Rate in DSP sampling theory). From the instruction cycle equation an increase in sample rate results in a linear decrease in instruction cycles that are available for processing. The Tail Length equation shows that a direct increase in sample rate (by some factor L) results in a decrease in Tail Length by a factor of L, so when the sample rate is increased by a factor of L the Tail Length of the echo canceller must decrease by a factor of L squared in order to perform the needed processing (for a fixed processor clock speed). Another way of viewing this is that in order to increase the Sample Rate by a factor of L and still maintain the same Tail Length the processor clock speed would need to increase by a factor of L squared.
To increase Bandwidth from 3 kHz (telephone quality) to 20 kHz (professional audio quality) would require an increase in Sample Rate by a factor of 6. For the example above the 250 ms tail length would shrink to about 7 ms for the same number of processing clock cycles.