Echo cancellation is an important element in a variety of applications. In general, echo cancellation is the digital cancellation of electrical and acoustic echoes such that the echoes are attenuated or eliminated. Echo cancellation is essential in applications such as communications systems, where it is used to improve sound quality. Echo cancellation is used to overcome several different types of echoes, including hybrid echoes, caused by an impedance mismatch along an electrical line (including a telephone line), and acoustic echoes, caused by acoustic coupling of sound from a loudspeaker to a microphone. These types of echoes appear in several different technologies, such as wireless telephony, hands-free telephony, teleconferencing systems, Internet telephony, and speech recognition systems. By using echo cancellation, the sound quality and usefulness of these and many other technologies is improved.
One type of echo cancellation is acoustic echo cancellation, which is used to cancel out the echoes of acoustic sound waves. Typically, these echoes are formed when sounds emitted by one or more loudspeakers is picked up by one or more microphones. Acoustic echoes can be quite noticeable and even annoying to a user.
In general, the acoustic echo cancellation works by obtaining one or more playback signals, each going to corresponding loudspeakers and subtracting an estimate of the echo produced by that playback signal from the one or more microphone signals. More specifically, the playback signals through this echo loop are transformed and delayed, background noise and possibly near end speech are added at the microphone, and a subtraction process for the echo cancellation is used. The signal obtained after subtraction is called the error signal, and the goal is to minimize the error signal when no near end speech is present in the microphone signal.
The heart of acoustic echo cancellation system is adaptive filtering. In general, an adaptive filter is used to identify or “learn” a transfer function of the room that contains the loudspeakers and microphones. This transfer function will depend a great deal on the physical characteristics of the room environment. The adaptive filter works by taking the playback signal sent to the speakers and adjusting in a recursive manner some coefficients that represent an impulse response of the room. The error signal, which is the estimated echo subtracted from the actual echo, is used to change the filter coefficients such that the error is minimized.
Traditionally, the playback signals are each processed as a single stream of temporal samples, with a single delay line and a single filter. To improve upon this, the playback signal can be split into subbands and a plurality of adaptive filters can be run in parallel, one adaptive filter per subband. Changing the length of the adaptive filters in the different subbands depending on the echo length in that subband in order to reduce the computational complexity is discussed in a paper by A. Gilloire entitled “Experiments with Sub-band Acoustic Echo Cancellers for Teleconferencing” in 1987 International Conference on Acoustics, Speech, and Signal Processing, 1987, pp. 2141-2144. From that paper, the adaptive filters for the lower subbands can be made longer in order to save CPU computation cycles because the bass tends to reverberate longer. In the upper subbands, the filters can be shorter. Thus, Gilloire's paper implied that longer adaptive filters in the lower subbands and shorter adaptive filters in the higher subbands can be used.
To cancel the echoes in a captured signal, each subband of the playback signal is stored in a digital delay line, where the delayed subband signals are separated into taps. At each tap, the playback signal is sampled. The number of taps of a filter describes the length of the digital delay line. For example, four taps means that the playback signal is sampled at the current frame, current frame-1, current frame-2, and current frame-3. Each of the delays is equal to the frame length (which can be, by way of example, approximately 16 milliseconds or 20 milliseconds). Thus, if the frame length is 16 ms, and there are four taps (or a 4-long adaptive filter), and if the adaptive filters are implemented using adaptive subband filtering in the frequency domain, the playback signal is examined at a current frame, the frame 16 ms earlier, the frame 32 ms earlier, and the frame 48 ms earlier than the current time.
Each sample gets multiplied by the complex conjugate of a weight (called a tap weight, W), the multiplied weight is summed, and then is subtracted from the microphone signal. Each tap weight is adjusted to minimize the output power. Minimizing the output power suppresses as much of the speaker signal as possible, thereby reducing echoes.
Acoustic echo cancellation was first used on monaural (or mono) systems. FIG. 1 illustrates a single channel, acoustic echo cancellation (AEC) system 100 used to process a mono, playback signal. A mono playback signal x 105 is copied into equal multi-channel signals and then played through a right speaker 110 and a left speaker 120. Echoes 130, 140 from each of the speakers 110, 120 are reflected off a wall 150 in a room and captured by a microphone 160. The microphone also captures desired speech 165 (such as from a teleconference participant) and background noise 170.
The echoes 130, 140, desired speech 165 and background noise 170 combine to construct a microphone signal y. The microphone signal y is processed by a first analysis filterbank 175 and the playback signal x is processed by a second analysis filterbank 180 such that signals x and y are transformed from the time domain into frequency domain signals X and Y, respectively. It is important to perform AEC in the frequency domain because the echoes in AEC are quite long and the adaptive filters converge more often and faster in the frequency domain than in the time domain. It should be noted that the analysis filterbanks 175, 180 can be implemented as any complex frequency domain transform such as a windowed (including the box window) fast Fourier transform (FFT) or, in an exemplary embodiment, a modulated complex lapped transform (MCLT).
The transformed X and Y signals are input to an AEC mono processor 185 that uses an adaptive filter to learn the transfer function of the room to minimize an error signal. The processed signal is sent to a synthesis filterbank 190 that transforms the echo-reduced, frequency domain signal containing near end speech back to the time domain. Note that the mono AEC processor in FIG. 1 only uses a single adaptive filter per subband.
FIG. 2 is a detailed block diagram of the mono AEC processor 185 shown in FIG. 1 for a single subband m and frame n. The mono AEC processor 185 contains a single adaptive filter 200 for each subband. An adaptive filter coefficient update 210 is used to update the processed coefficients of the subband adaptive filter 200. When the mono playback signal x is played to the speakers 110, 120, as shown in FIG. 1, the single adaptive filter 200 is implemented. In a typical embodiment, the adaptive filter uses a normalized, least mean square (NLMS) algorithm having regularization. The NLMS algorithm with regularization is set forth in detail below.
When dividing one number by a second number, regularization is the process of adding or subtracting a small value to the denominator to ensure that the denominator never becomes zero, which in turn would cause the fraction to become infinite. An alternative way to regularize the fraction is to set the denominator equal to some threshold if the denominator is positive and less than the threshold. Likewise, if the denominator is negative, set the denominator to a negative threshold if it is less than the negative threshold.
The single channel AEC system 100 shown in FIG. 1 is only for removing echo from a mono playback signal. One of the first papers to discuss extending AEC to stereo was a paper by M. Sondhi and D. Morgan entitled “Acoustic echo cancellation for stereophonic teleconferencing” in Proc. IEEE Workshop Appls. Signal Processing Audio Acoustics in 1991. However, while the NLMS algorithm works well for the mono AEC problem, NLMS performs poorly in the stereo (or other multi-channel) AEC problem. This is because NLMS does not consider the cross-channel correlation of the multi-channel playback signal which significantly slows down the convergence of the adaptive filters.
Sondhi and Morgan suggested using recursive least squares (RLS) instead of NLMS to solve the stereo AEC problem. The RLS algorithm is an alternative algorithm for adjusting the parameters (or weights) of the adaptive filters. The reason RLS works better than NLMS is that RLS tends to decorrelate the playback channels. Since RLS recursively computes an estimate of the inverse of a correlation matrix of the input speaker data, it can learn the correlation between the speaker channels and quickly converge to the correct solution. Shondhi and Morgan, however, merely proposed potentially using the RLS algorithm instead of NLMS, but provided no detail.
FIG. 3 illustrates a stereo AEC system 300 used to process a stereo playback signal. The stereo AEC system 300 shown in FIG. 3 is a subband-based system, meaning that the speaker signal is split into a plurality of subbands and an adaptive filter is supplied for each subband. The adaptive filters are run in parallel. It should be noted that FIG. 3 illustrates the AEC system 300 having a stereo playback signal, although AEC systems can be designed that also work for multi-channel playback signals. In addition, a single microphone is illustrated in FIG. 3, but the AEC system 300 is easily extendible to multiple microphones.
Referring to FIG. 3, the stereo playback signal x is composed of two channels, a right stereo channel x(0) 302 and a left stereo channel x(1) 305, for the stereo playback case. For the multi-channel AEC case, the N channel playback case, the signal would be composed of channels x(0) to x(N−1). The playback signals 302 and 305 are converted to an analog signal by a digital-to-analog converter (D/A) (not shown).
The multi-channel playback signal (which includes the stereo signal) can be created in several different ways. FIG. 4 illustrates the AEC system 300 of FIG. 3 used with a voice communications system, such as Microsoft® Windows Messenger or voice chat for internet gaming. In FIG. 4, a digital, far-end mono speech signal 400 arrives from a source. The speech signal 400 is mixed locally with some stereo audio sounds such as MusicLeft 410 and MusicRight 420, computer game sounds, or the computer's system sounds.
Alternatively, FIG. 5 illustrates the case where the multi-channel playback signal includes stereo music. In this case, the MusicLeft 410 is assigned to a multi-channel playback channel and the MusicRight 420 is assigned to another multi-channel playback channel. There is no far-end speech that is mixed with the multi-channel sound.
In another alternate case, FIG. 6 illustrates the case where the multi-channel signal includes mono speech. This situation shown in FIG. 6 may be used for a Microsoft® Windows messenger system. The mono speech 600 is copied to each of the playback channels, but the multi-channel playback signal is monaural.
Referring back to FIG. 3, the playback signals 302, 305 next are played through a right speaker 310 and a left speaker 320, respectively. A first echo 330 and a second echo 340 are reflected off a wall 350 in a room (not shown) to produce echoes at the microphone 355. In the case of multiple microphones, a separate instance of the stereo AEC system 300 can process the signal captured from each microphone independently or one AEC algorithm could be processed on the mono output of a microphone array algorithm. In addition to the echo from the speakers, the audio signal that is captured by the microphone 355 is also composed of a desired speech 360 and background noise 365. The analog audio signal captured by the microphone 355 is converted into a digital microphone signal, y, by an analog-to-digital converter (A/D) (not shown).
Acoustic echo cancellation is often performed using adaptive subband filtering based on a frequency domain transform such as the fast windowed transform (FFT) or the modulated complex lapped transform (MCLT). A first filterbank 370 and a second filterbank 375 convert each of the stereo playback signals x(0) and x(1) from the time domain to the frequency domain signals X(0) and X(1), respectively. Likewise, a third analysis filterbank 380 converts the mono microphone signal y from the time domain to the frequency domain signal Y. The signals are processed by the stereo AEC processor 385 and the output Z is run through a synthesis filterbank 390. A time domain signal z with reduced echo then is output.
FIG. 7 is a detailed block diagram of the stereo AEC processor 385 for a single subband shown in FIG. 3. The stereo AEC processor contains a first adaptive filter 700 for the first multi-channel playback signal X(0) and a second adaptive filter 710 for the second multi-channel playback signal X(1). Note that separate single channel filters are run in parallel on each subband independent of the other subbands. As described above with regard to FIG. 3, the playback signals X(0), X(1) are processed and an adaptive filter coefficient update 720 is used to update the processed coefficients of the single channel filters 700, 710. The frequency domain signal Z, with reduced echo, then is output.
However, one problem with the RLS algorithm for computing the adaptive filter weights is that it has a high computational complexity. This complexity is on the order of O(2N^2+6N) compared to O(2N) for the least mean squares (LMS) where N=C*L, C is the number of playback channels, and L is the adaptive filter length in the subband. Previously, this computational complexity of RLS prohibited its use in AEC in practical systems. A paper by B. Hatty entitled, “Recursive Least Squares Algorithms using Multirate Systems for Cancellation of Acoustical Echoes” in 1990 International Conference on Acoustics, Speech, and Signal Processing, 3-6 Apr., 1990, vol. 2, pp. 1145-1148, was one of the first papers that discussed using a fast RLS (FRLS) for mono AEC. FRLS increases the speed and decreases the complexity of RLS by avoiding the use of a correlation matrix (or any other types of matrices). One problem, however, with FRLS is that it is quite unstable. As a result of this instability, the FRLS algorithm can quickly diverge. There have been several attempts to improve the stability of FRLS. However, to date, no one has come up with a satisfactory solution for the multi-channel AEC problem. Hatty, in an attempt to improve the stability of FRLS, proposed using a round robin scheme by resetting the entire FRLS algorithm periodically in a band-by-band fashion. What Hatty did was to completely reinitialized a band by throwing away the entire state of the algorithm and restarting it from scratch.
The problem with this reset technique, however, is that this resetting caused echo leakthrough for the band being reset, due to the FRLS algorithm having to reconverge and relearn the transfer function of the room after each reset. In addition, the Hatty technique caused distortion on the playback signal due to the fact that at any given time there was at least of portion of the algorithm is being reset.
In 1995, J. Benesty, J., et. al. in a paper entitled, “Adaptive Filtering Algorithms for Stereophonic Acoustic Echo Cancellation” in Proc. ICASSP'95, pp. 3099-3102 used fast RLS (FRLS) to try and solve the stereo AEC problem. However, the Benesty paper suggested using FRLS in the time domain instead of using adaptive subband filtering.
In another paper by J. Benesty, D. Morgan, and M. Sondhi entitled, “A Better Understanding and an Improved Solution to the Problems of Stereophonic Acoustic Echo Cancellation” in Proc. ICASSP'97, pp. 303-306, an update was proposed. In the Benesty '97 paper, in order to decorrelate the left channel from the right channel (which were very similar), Benesty added a nonlinearity to both channels. In one implementation, Benesty added the positive portion of the nonlinearity to one channel and the inverse (or negative) portion of the nonlinearity to the other channel. This introduced nonlinearity forced the channels to be different enough that the adaptive filters could learn the individual paths. In this way, the channels were decorrelated and made different enough so that the non-uniqueness problem associated with having to track the far-end transfer functions from the far-end person to far-end stereo microphones, as well as the near-end transfer functions from the near-end speakers to the near-end microphones, could be avoided.
The problem with adding a nonlinearity to the signal (as is done in the Benesty '97 paper) is that adding any type of the nonlinearity tends to distort the signal. Basically, adding a nonlinearity is adding distortion to the signal. Adding distortion, however, is undesirable if the AEC system is to work well with a system that involves music playback. Ideally, for music playback, the signal should be free of distortion so that the music is played back faithfully.
In the paper by P. Eneroth, S. Gay, T. Gansler, and J. Benesty entitled “A Real-Time Implementation of a Stereophonic Acoustic Echo Canceller” in IEEE Trans. On Speech and Audio Processing, Vol 9. no. 5, July 2001, pp. 513-523, a solution to the stereophonic AEC problem is proposed using FRLS in subbands and adding non-linearities to the playback channels. This paper attempts to increase stability by running parallel structures of the FRLS algorithm so that when one of the structures “blows up” or goes unstable, they can fall back on to another structure that is less than optimal. This implementation helps them reinitialize the algorithm.
In 1990, when the Hatty '90 paper proposed using FRLS for adaptive subband AEC processing, microprocessors were much slower than today's microprocessors. As a result, RLS was not a practical solution for the multi-channel AEC problem. However, with the significant increase in speed of modern microprocessors, RLS can now be used. However, the RLS algorithm will become unstable and diverge if the correlation matrix of the multi-channel playback signal becomes singular.
Therefore, what is needed is an echo cancellation system and method that can be used for a multi-channel playback signal. In addition, what is needed is a multi-channel echo cancellation system and method that avoids the use of FRLS to prevent the system from becoming unstable. In addition, what is needed is a multi-channel echo cancellation system and method that avoids the use of adding distortion to the playback signal. What is also needed is a multi-channel echo cancellation system and method that avoid and overcomes the problems of the RLS algorithm discussed above to effectively eliminate echo while retaining a faithful reproduction of the original signal.