1. Field of the Invention
The present invention relates to an audio communication system and method with improved acoustic characteristics, and particularly to a video conferencing system including an improved audio echo cancellation system.
2. Description of the Related Art
In a background conferencing system set-up that uses loudspeakers, two or more communication units are placed at separate sites. A signal transmitted from one site to another site using a conference system experiences several delays. The delay includes a transmission delay and a processing delay. For a video conferencing system, the processing delay for video signals is considerably larger than the processing delay for the audio signals. Because the video and audio signals have to be presented simultaneously, in phase, a lip sync delay is purposefully introduced to the audio signal, in both the transmitting and receiving signal paths to compensate for the longer video signal delay.
In a background conferencing system, one or more microphones captures a sound wave at a site A, and transforms the sound wave into a first audio signal. The first audio signal is transmitted to a site B, where a television set or an amplifier and loudspeaker, reproduces the original sound wave by converting the first audio signal generated at site A into the sound wave. The produced sound wave at site B, is captured partially by the audio capturing system at site B, converted to a second audio signal, and transmitted back to the system at site A. This problem of having a sound wave captured at one site, transmitted to another site, and then transmitted back to the initial site is referred to as an acoustic echo. In its most severe manifestation, the acoustic echo might cause a feedback sound, when the loop gain exceeds unity. The acoustic echo also causes the participants at both sites A and B to hear themselves, making a conversation over the conferencing system difficult, particularly if there are delays in the system set-up, as is common in video conferencing systems, especially due to the above mentioned lip sync delay. The acoustic echo problem is usually solved using an acoustic echo canceller, described below.
In more detail, FIG. 1 shows a background conferencing system set-up. For simplicity, FIG. 1 shows the conferencing system set-up distributed at two sites A and B. The two sites are connected through a transmission channel 1300 and each site has a loudspeaker 1100 and 1200, respectively, and a microphone 1111 and 1211, respectively. The arrows in FIG. 1 indicate the direction of propagation for an acoustic signal, usually from the microphone to the loudspeaker.
Further, FIG. 2 is an overall view of a video conferencing system. This system is distributed at two sites A and B. As for the conferencing system set-up, a video conferencing module can be distributed at more than two sites and also the system set-up is functional when only one site has a loudspeaker. The video module has at site A a video capturing system 2141 that captures a video image and a video subsystem 2150 that encodes the video image. In parallel, a sound wave is captured by an audio capturing system 2111 and an audio subsystem 2130 encodes the sound wave to the acoustic signal. Due to processing delays in the video encoding system, the control system 2160 introduces additional delays to the audio signal by use of a lip sync delay 2163 so to achieve synchronization between the video and audio signals. The video and audio signals are mixed together in a multiplexer 2161 and the resulting signal, the audio-video signal is sent over the transmission channel 2300 to site B. Additional lip sync delay 2262 is inserted at site B. Further, the audio signal presented by the audio presenting device 2221 is materialized as a sound wave at site B. Part of the sound wave presented at site B arrives to the audio capturing device 2211 either as a direct sound wave or as a reflected sound wave. Capturing the sound at site B and transmitting this sound back to site A together with the associated delays forms the echo. All delays described sum up to be considerable and therefore the quality requirements for an echo canceller in the video conferencing system are particularly high.
Next, FIG. 3 shows an example of an acoustic echo canceller subsystem, which may be a part of the audio system in the video conferencing system of FIG. 2. At least one of the participant sites has the acoustic echo canceller subsystem to reduce the echo in the communication system. The acoustic echo canceller subsystem 3100 is a full band model of a digital acoustic echo canceller. A full band model processes a complete audio band (e.g., up to 20 kHz; for video conferencing the band is typically up to 7 kHz, in audio conferencing the band is up to 3.4 kHz) of the audio signals directly.
As already mentioned, compensation of acoustic echo is normally achieved by an acoustic echo canceller. The acoustic echo canceller is a stand-alone device or an integrated part in the case of the communication system.
The acoustic echo canceller transforms the acoustic signal transmitted from site A to site B, for example, using a linear/non-linear mathematical model and then subtracts the mathematically modulated acoustic signal from the acoustic signal transmitted from site B to site A. In more detail, referring for example to the acoustic echo canceller subsystem 3100 at site B, the acoustic echo canceller passes the first acoustic signal 3131 from site A through the mathematical modeller of the acoustic system 3121, calculates an estimate 3133 of the echo signal, subtracts the estimated echo signal from the second audio signal 3132 captured at site B, and transmits back the second audio signal 3135, less the estimated echo to site A. The echo canceller subsystem of FIG. 3 also includes an estimation error, i.e., a difference between the estimated echo and the actual echo, to update or adapt the mathematical model to a background noise and changes of the environment, at a position where the sound is captured by the audio capturing device.
The model of the acoustic system 3121 used in most echo cancellers is a FIR (Finite Impulse Response) filter, approximating the transfer function of the direct sound and most of the reflections in the room. The FIR filter will preferably not, mainly due to processing power, provide echo cancellation in an infinite time after the signal was captured by the loudspeaker. Instead, it will accept that the echo after a given time, the so-called tail length, will not be cancelled, but will appear as a residual echo.
To estimate the echo in the complete tail length, the FIR filter will need a length L=Fs*tail length, where Fs is the sampling frequency in Hz, and where the tail length is given in seconds.
The required number of each of the multiplications and additions to calculate one single sample output of the filter equals the filter length, and the output of the filter should be calculated once per sample. That is, the total number of multiplications and additions are Fs*L=Fs*Fs*tail length=tail length*Fs2.
A typical value for a tail length is 0.25 sec. The number of multiplications and additions for Fs=8 kHz system will be 16 Million, for 16 kHz 64 Million and for 48 kHz 576 Million.
Similar calculations could be performed for the filter update algorithm. The simplest algorithm, LMS (Least Mean Square), has a complexity proportional to the filter length, which implies a processing power requirement proportional to Fs2, while more complex algorithms have processing power proportional to the square of the filter length, which implies a processing power requirement proportional to Fs3.
One way of reducing the processing power requirements of an echo canceller is to introduce sub-band processing, i.e., the signal is divided into bands with a smaller bandwidth, which can be represented using a lower sampling frequency. An example of such system is illustrated in FIG. 4.
Analyze filters 4125, 4131 divide the full band signals from far end and near end, respectively, in N sub-bands. The echo cancellation and miscellaneous sub-band processing (typically, but not limited to non-linear processing and noise reduction) is performed in each sub-band, and thereafter a synthesize filter 5127 recreates the modified full band signals. Note that in the following complexity calculations, many minor processing blocks are omitted, as their contribution to the overall processing power requirements are small.
The analyze filters 4125, 4131 include a filter bank and a decimator, while the synthesize filter 5127 includes a filter bank and an interpolator. The full band signals have sampling frequency Fsfullband. The sub-band signals will have a sampling frequency of Fssub-band=K/N*Fsfullband. K is an over sampling factor, introduced to simplify and reduce the processing power requirements of the filter bank. K is always larger than one, but most often relatively small, typically less than two.
The processing power for the filtering and adaptation (assuming FIR and LMS) for the sub-band case is:Osub-band=c1*taillength*Fssub-band2=c1*taillength*(K/N*Fsfullband)2 (c1 is a proportionally constant).
Thus, for a high N, the processing power requirements of the filtering can be reduced. However, for the total processing power, the overhead of the analyze and synthesize filters must be added.
Effective methods of analyzing and synthesizing the signals are based on a transform, for example a FFT. The methods have complexity Ooverhead=c2*N*log2N, where N is the number of subbands, and c2 is a proportionally constant. The number of subbands will be proportional with Fsfullband, and thus Ooverhead=c3*Fsfullband*log2Fsfullband.
That is, the total complexity is:O=Osubband+Ooverhead=c1*taillength*(K/N*Fsfullband)2+c3Fsfullband*log2Fsfullband.
The echo filtering/adaption is proportional to Fsfullband2. It is possible to reduce the filtering/adaption part by increasing the number of subbands, but at the expense of increased overhead for the calculations of the subband signals. Still, by using a large number of subbands, i.e. using a large fast transform, it is possible to obtain a complexity which increases with Fsfullband*log2Fsfullband.
Though theoretically possible, this may be difficult to achieve in practical implementations, due to cache ineffiency in signal processing when applying large transforms.
Thus, efforts have been made for providing a system allowing reduction in the number of sub-bands without increasing the sub-bandwidths.