To date, most speakerphones operate in a half-duplex mode; i.e. only one caller can be heard at a given instant. A half-duplex arrangement imparts an annoying chopping of speech as the near-end caller and the far-end caller attempt to speak at the same time. A full-duplex speakerphone, on the other hand, allows for the near-end and far-end callers to talk simultaneously, thus avoiding the distracting interruption of speech.
Full-duplex speakerphones, however, suffer a problem due to the regenerative effects of acoustic echo paths that occur between the speaker and the microphone. The problem manifests itself as audible echo and possibly a howlback condition as the echo is retransmitted and re-amplified between the near-end and far-end speakerphones. An acoustic echo canceller (AEC) is commonly employed to eliminate these problems.
In general, an AEC is an adaptive filter which models the acoustic response of a room. A modeling component generates an estimate of the echo signal that will be formed by the room using an incoming signal from a far-end caller. The filter operates on an outgoing signal which includes a speech signal of the near-end caller and echo signals resulting from acoustic reflections of the incoming signal within the room. A "clean" signal, formed by subtracting the estimated echo signal from the outgoing signal, is then transmitted to the far-end caller. By comparing the "clean" signal to the incoming signal, an adaptation component of the AEC adapts the filter to more accurately approximate the room response.
An AEC performs a computationally demanding task. To model the room response, it is typical to have a 2000 tap filter capable of computing the next sample at a rate of 8000 Hz for a normal telephone channel. The AEC is generally implemented on some type of digital processor such as a microprocessor, a digital signal processor (DSP), a microcontroller or an application specific digital integrated circuit (digital ASIC).
The room response modeled by AECs is generally a linear model, consisting of a series of coefficients which represent the strength of the acoustic signal for a period of time. Pragmatic AECs have used the finite impulse response (FIR) filter as the model. The coefficients of the FIR filter are usually adapted by a least mean squares (LMS) technique to match the room response. This is referred to as the time-domain LMS technique. Time-domain LMS has the advantage of operating without imposing any significant delay between accepting the outgoing signal from the near-end speakerphone, which contains the desired near-end speech and the undesired room echo, and generating the "clean" signal for subsequent transmission to the far-end speakerphone. However, this quick response time is obtained at the expense of a computationally demanding process. Moreover, the rate of convergence, i.e. the time it takes for a filter to adapt its parameters to adequately model the acoustic characteristics of the room, using the time-domain LMS approach is very slow because voice signals are so highly correlated.
The most popular alternative to the time-domain LMS technique which exhibits improved convergence performance is a method known as subband filtering. Subband filtering divides the input signal into separate frequency bands for subsequent processing. This divide-and-conquer approach converges faster than the standard time-domain LMS method because there is less correlation between samples in each subband. However, the tradeoff is an increase in delay due to the necessary additional processing of a polyphase filter at the front end of the subband filter to compute the initial subband filter banks.
It has been well known that dramatic computational savings can be realized by performing the computations of the FIR filter in the frequency-domain instead of operating in the time-domain. See generally Clark et al., "A Unified Approach to Time--and Frequency-Domain Realization of FIR Adaptive Digital Filters," Vol. ASSP-13, No. 5, IEEE Transactions on Acoustics, Speech, and Signal Processing, pages 1073-1083 (October 1983) and U.S. Pat. No. 4,807,173 to Sommen et al. Frequency-domain filtering employs the same basic approach as described above, except that the signals are processed in the frequency-domain. Thus, a time-domain incoming (input) signal is sampled and converted to the frequency-domain, using for example a particular implementation of the Discrete Fourier Transform (DFT) known as the fast Fourier transform (FFT). A frequency-domain model of the room response is used to generate a frequency-domain estimate of the expected echo, which is converted to the time-domain and subtracted from a time-domain representation of the outgoing signal. The subtracted signal is 1) sent to the far-end caller and 2) is used by the AEC as an error signal to adapt the frequency-domain model of the room response. These frequency-domain quantities, called DFT vectors, are complex vectors whose elements correspond to a frequency. The individual elements of each vector are commonly called bins. The basic approach just described suffers from long delays needed to acquire a sufficiently large sample of the input signal to compute the necessary FFT. These delays would result in noticeable periods of silence which would tend to be very distracting to the human listener.
To minimize delay while achieving efficiency, a better implementation of a frequency-domain adaptive filter is to have a multiplicity of smaller blocks of DFT vectors, and to perform the filtering operation using these smaller blocks. The processing steps are essentially the same as in the non-blocked approach. However, the room response in this block frequency-domain adaptive filter, is modeled using an array consisting of smaller frequency-domain coefficient vectors to provide an estimate of the echo response. Each of the smaller vectors approximates a portion of the echo response because each vector represents a smaller period of time. Furthermore, each vector approximates a portion of the echo response for a different window of time such that the complete echo response is a composite of the partial echo responses. See generally, Asharif, M. R. et al., "Frequency Bin Adaptive Filtering (FBAF) Algorithm and Its Application to Acoustic Echo Cancelling,"I.E.I.C.E. Transactions, Vol. E 74, No. 8, August 1991, pp. 2276-2282 and Soo, J. et al., "Multidelay Block Frequency Domain Adaptive Filter," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 38, No. 2, February 1990, pp. 373-376.
Adaptation of the block frequency-domain coefficient vectors involves a correlation between the error signal and the input signal. The frequency-domain coefficient vectors are adjusted by addition of the resulting correlation vectors, so that the filter characteristics will move in the direction to minimize residual echo in the error signal.
To date, frequency domain adaptation has proceeded in one of two ways: constrained and unconstrained. A constrained adaptation involves additional processing of the N frequency-domain correlation term. In the constraint operation, the frequency-domain correlation terms are transformed into N corresponding time-domain terms. The last N/2 time-domain terms are set to zero to eliminate the circular component. These constrained time-domain terms are then transformed back to the frequency-domain.
From a theoretical point of view, it is preferred that the frequency-domain coefficient vectors are adapted in a constrained manner. The reason is that unconstrained adaptation results in a build-up of a "circular convolution" component in the frequency-domain coefficients which causes corruption of the coefficients. By constraining the time-domain correlation terms, a linear convolution results, thus avoiding the occurrence of circular convolution altogether. Constrained adaptation, however, involves two additional DFTs per coefficient vector, and therefore imposes additional computational burdens on the AEC. On the other hand, the unconstrained technique converges at a slower rate and, in the steady state, converges to less accurate coefficient vectors, resulting in a less accurate model of the room response. Moreover, by using the unconstrained approach, the circular convolution component may be large enough to produce distracting audible artifacts.
A further consideration is the fact the echoes resulting from reflections off the walls, the furniture and other objects in the room exhibit a decaying response. While the echo signal decays over time, the strength of the noise component remains substantially undiminished. Left uncompensated, this masking effect will tend to destabilize the filter, thus slowing the rate of convergence and the long-term accuracy of the filter.
An approximation to a constrained adaptation approach is described in U.S. Pat. No. 4,807,173 to Sommen et al., which relies on the fact that multiplication of the time-domain window function is equivalent to a convolution of the window function in the frequency-domain. Sommen et al. define a specialized time-domain window function to approximate the effect of a DFT-based constraint operation such that the frequency-domain convolution operation reduces to three multiplication operations. Sommen et al. do not disclose a method which addresses the masking effect of the additive noise due to the presence of a decaying echo response.
An unconstrained adaptive filter is described in U.S. Pat. No. 5,117,418 to Chaffee et al. However, Chaffee et al. discuss the analogous situation of cancellation of echoes originating from the imperfections found in the equipment located at the local telephone switching office. The technique is commonly known as line echo cancellation (LEC).
An unconstrained adaptive filter is advocated in an article by J. M. P. Borrallo et al., "On the Implementation of the Partitioned Block Frequency Domain Adaptive Filter (PBFDAF) for Long Acoustic Echo Cancellation," Vol 27, No. 3, Signal Processing, pages 301-315 (June 1992). Borrallo et al. teach that the unconstrained approach is computationally efficient, and that under some favorable conditions, the approach converges to the Wiener solution. However, given that speakerphones are used in a wide variety of operating environments, it cannot be assumed that the favorable conditions anticipated by Borrallo et al. will be present in any particular situation. The paper also addresses the destabilizing effect of the additive noise masking the echo signal as it decays over time, and describes a progressive attenuation method applied during the cross-correlation step as a way of ensuring filter stability.
It is an object of the present invention to provide an efficient system and method of echo cancellation for use in a full-duplex speakerphone, which exhibits a fast convergence and is less sensitive to noise.
It is yet another object of the present invention to provide a system and method of echo cancellation which can be performed with minimal computational overhead.
It is therefore an object of the present invention to provide an efficient system and method of echo cancellation for use in a full-duplex speakerphone, which exhibits a fast convergence and is less sensitive to noise.