In a conventional conferencing system set-up that uses loudspeakers, two or more communication units are placed at separate sites. A signal transmitted from one site to another site using a conference system experiences several delays, these delays will include a transmission delay and a processing delay. For a video conferencing system, the processing delay for video signals is considerably larger than the processing delay for the audio signals. Because the video and audio signals have to be presented simultaneously, in phase, a lip sync delay is purposefully introduced to the audio signal, in both the transmitting and receiving signal paths in order to compensate for the longer video signal delay.
In a conventional conferencing system, one or more microphones captures a sound wave at a site A, and transforms the sound wave into a first audio signal. The first audio signal is transmitted to a site B, where a television set or an amplifier and loudspeaker, reproduces the original sound wave by converting the first audio signal generated at site A into the sound wave. The produced sound wave at site B, is captured partially by the audio capturing system at site B, converted to a second audio signal, and transmitted back to the system at site A. This problem of having a sound wave captured at one site, transmitted to another site, and then transmitted back to the initial site is referred to as acoustic echo. In its most severe manifestation, the acoustic echo might cause feedback sound, when the loop gain exceeds unity. The acoustic echo also causes the participants at both site A and site B to hear themselves, making a conversation over the conferencing system difficult, particularly if there are delays in the system set-up, as is common in video conferencing systems, especially due to the above mentioned lip sync delay. The acoustic echo problem is usually solved using an acoustic echo canceller, described below.
FIGS. 1A and 1B are an overall view of a video conferencing system. This system is distributed at two sites, A and B. As for the conferencing system set-up, a video conferencing module can be distributed at more than two sites and the system set-up is functional when only one site has a loudspeaker. The video module has at site A a video capturing system 2141 that captures a video image and a video subsystem 2150 that encodes the video image. In parallel, a sound wave is captured by an audio capturing system 2111 and an audio subsystem 2130 encodes the sound wave to the acoustic signal. Due to processing delays in the video encoding system, the control system 2160 introduces additional delays to the audio signal by use of a lip sync delay 2163 so to achieve synchronization between the video and audio signals. The video and audio signals are mixed together in a multiplexer 2161 and the resulting signal, the audio-video signal is sent over the transmission channel 2300 to site B. Additional lipsync delay 2262 is inserted at site B. Further, the audio signal presented by the audio presenting device 2221 is materialized as a sound wave at site B. Part of the sound wave presented at site B arrives to the audio capturing device 2211 either as a direct sound wave or as a reflected sound wave. Capturing the sound at site B and transmitting this sound back to site A together with the associated delays forms the echo. All delays described sums up to be considerable and therefore the quality requirements for an echo canceller in the video conferencing system are particularly high.
FIG. 2 shows an example of an acoustic echo canceller subsystem, which may be a part of the audio system in the video conferencing system of 1A and 1B. At least one of the participant sites has the acoustic echo canceller subsystem in order to reduce the echo in the communication system. The acoustic echo canceller subsystem 3100 is a full band model of a digital acoustic echo canceller. A full band model processes a complete audio band (e.g., up to 20 kHz; for video conferencing the band is typically up to 7 kHz, in audio conferencing the band is up to 3.4 kHz) of the audio signals directly. The acoustic echo canceller subsystem 3100 is shown coupled to acoustic system 3200 that includes audio capturing system 3210 (microphone 3211 ) and audio presenting system 3220 (amplifier 3221 and loud speaker 3222 ). Direct sound wave 3241 and reflected sound wave 3242 in relation to loud speaker 3222 are captured by microphone 3211 along with other sound waves 3251.
As already mentioned, compensation of acoustic echo is normally achieved by an acoustic echo canceller. The acoustic echo canceller is a stand-alone device or an integrated part in the case of the communication system. The acoustic echo canceller transforms the acoustic signal transmitted from site A to site B, for example, using a linear/non-linear mathematical model and then subtracts the mathematically modulated acoustic signal from the acoustic signal transmitted from site B to site A. In more detail, referring for example to the acoustic echo canceller subsystem 3100 at site B, the acoustic echo canceller passes the first acoustic signal 3131 from site A through the mathematical modeller of the acoustic system 3121, calculates an estimate 3133 of the echo signal, subtracts the estimated echo signal from the second audio signal 3132 captured at site B, and transmits back the second audio signal 3135, less the estimated echo to site A. The echo canceller subsystem of FIG. 2 also includes an estimation error 3134, i.e., a difference between the estimated echo and the actual echo, to update or adapt the mathematical model at 3141 to a background noise and changes of the environment, at a position where the sound is captured by the audio capturing device.
The model of the acoustic system 3121 used in most echo cancellers is a FIR (Finite Impulse Response) filter, approximating the transfer function of the direct sound and most of the reflections in the room. A full-band model of the acoustic system 3121 is relatively complex and requires processing power, and alternatives to full-band models are normally preferred.
One way of reducing the processing power requirements of an echo canceller is to introduce sub-band processing, i.e. the signal is divided into bands with smaller bandwidth, which can be represented using a lower sampling frequency. An example of such system is illustrated in FIG. 3. The loudspeaker and microphone signals are divided by the analyze filter into sub bands, each representing a smaller range of frequencies of the original loudspeaker and microphones respectively. Similar echo cancelling and other processing are performed on each sub band, before all bands of the modified microphone are merged together to form the full band signal, by the synthesize filter.
In some cases, it may be convenient to combine sub band and full band processing. Some sub algorithms can be performed both in full band and in sub bands, or a combination.
The core component in an echo cancellator is the already mentioned acoustic model (most commonly implemented by a FIR filter). The acoustic model attempts to imitate the transfer function of the far end signal from the loudspeaker to the microphone. This adaptive model is updated by gradient search algorithm. The algorithm tries to minimize an error function, which is the power of the signal after the echo estimate is subtracted. For a mono echo canceller, this solution works, it is a uniform and unique solution.
However, in high quality communications, it is often desirable to transmit and present high quality multi channel audio, e.g. stereo audio. Stereo audio includes audio signals from two separate channels representing different spatial audio from a certain sound composition. Loading the channels on each respective loudspeaker creates a more faithful audio reproduction, as the listeners will perceive a spatial difference between the audio sources from which the sound composition is created.
The signal that is played on one loudspeaker differs from the signal presented on the other loudspeaker(s). Thus, for a stereo (or multi channel) echo canceller, the transfer function from each respective speaker to the microphone needs to be compensated for. This is a somewhat different situation compared to mono audio echo cancellation, as there are two different but correlated signals to compensate for.
In addition, the correlation in the different channels tends to be significant. This causes the normal gradient search algorithms to suffer. Mathematically expressed, the correlation introduces several false minimum solutions to the error function. This is i.a. described in Steven L. Gat and Jacob Benesty “Acoustic signal processing for telecommunication”, Boston: Kluwer Academic Publishers, 2000. The fundamental problem is that when multiple channels carry linearly related signals, the solution of the normal function corresponding to the error function solved by the adaptive algorithm is singular. This implies that there is no unique solution to the equation, but an infinite number of solutions, and it can be shown that all but the true one depend on the impulse responses of the transmission room (in this context, the transmission room may also include a synthesized transmission room as e.g. recorded or programmed material played back at the far-end side). The gradient search algorithm may then be trapped in a minimum that is not necessarily the true minimum solution.
Another common way of expressing this stereo echo canceller adaptation problem is that it is difficult to distinguish between a room response change and an audio “movement” in the stereo image. For example, the acoustic model has to reconverge if one talker starts speaking at a different location at the far end side. There is no adaptive algorithm that can track such a change sufficiently fast, and a mono echo canceller in the multi-channel case does not result in satisfactory performance.
A typical approach for overcoming the above-mentioned false minimum solutions problem mentioned above is shown in FIG. 4. Compared to the mono case, the analyze filter is duplicated, dividing both the right and left loudspeaker signal into sub bands. The acoustic model is divided into two models (per sub band), one for the right channel transfer function and one for the left channel transfer function.
To overcome the false minimum solutions introduced by the correlation between the left and right channel signals, a decorrelation algorithm is introduced. This decorrelation makes it possible to correctly update the acoustic models. However, the decorrelation technique also modifies the signals that are presented on the loudspeakers. While quality preserving modification techniques could be acceptable, the decorrelation techniques according to prior art severely distort the audio.
Therefore, these techniques may solve the stereo echo problem, but it does not preserve the necessary quality of the audio.