1. Field of the Invention
This invention relates to echo cancellation, in particular for use in spatialised audio systems such as are used in videoconference systems, that is, systems in which two or more users at different locations participate in a conversation using audiovisual telecommunications systems. At present most systems use a wideband monophonic audio system with the sound reproduced using either an internal loudspeaker fitted to the vision monitor, or an auxiliary hi-fi loudspeaker. This provides good quality audio, but there is no method of co-locating the video and sound images. For a small Visual Display Unit this is not a problem, but as the move to more lifelike and immersive environments continues it will be necessary to have spatialised sound.
2. Description of Related Art
In any two-way audio system, in which sound is travelling both to and from the same user there is a problem of acoustic feedback or “echo” between the loudspeaker of the input channel and the microphone of the output channel. As shown in FIG. 1, which shows a simple mono system, echo occurs when there is an acoustic path between the loudspeaker (22) and the microphone (12) in Room “B” as represented by h1(t) in FIG. 1. The talker 31 in room ‘A’ hears an echo of his own voice which passes from the microphone 11 in Room “A”, the outward channel 41, the loudspeaker 22, in Room “B”, the acoustic path h1(t), the microphone 12 in Room “B”, the return channel 42 and the loudspeaker 21 in Room “A”. The listener 32 in room ‘B’ also hears an echo of the talker 31, having passed through the path described above, and then through the acoustic paths h2(t) between the loudspeaker 21 and microphone 11 in Room “A” to pass a second time over the outward channel 41 to the loudspeaker 22 in Room “B”. These are called ‘talker echo’ and ‘listener echo’ respectively.
Depending on their size and time delays, echoes may be anything from unnoticeable to a devastating impairment resulting in instability and howling.
The creation of echo may be avoided by the use of body-mounted devices. Headphones can be used to isolate the incoming sound from the microphone. Alternatively, “close” microphones (e.g. clipped to the user's clothing), can also be used. These have a sensitivity which falls off rapidly with distance, so that the wearer's voice is detected clearly whilst signals emanating from a loudspeaker some distance away are only detected at a low volume. However, body-mounted devices are not always practical, and are inconvenient for the user to use, and require each user at a given location to be provided for individually.
In situations where the creation of echo cannot be avoided, some form of echo control or cancellation is desirable, to avoid the echo signal returning to the source of the original signal. A simple form of echo control, called echo suppression, is illustrated in FIG. 2. Echoes are prevented by only allowing signal transmission if the person in the room is speaking. Switches 51, 52, or variable attenuators in more sophisticated systems, are controlled by analysing the send and receive signals 41, 42 and using a decision algorithm to determine when transmission is permitted. This is a very effective echo control method, but can be very intrusive. Double talk, i.e. both parties talking at the same time, is not possible.
The principle behind echo cancellation may be seen from FIG. 3. The echo signal in the return path 42, represented by d(t), is caused by the acoustic path between loudspeaker and microphone. Cancellation is achieved in a canceller 62 by creating a synthetic model ĥ(t) of the signal path such that the echo may be removed by subtraction in a combiner 90 in the return path 42. The signal e(t) is now free of echoes, containing only sounds that originated in room B. The modelling of ĥ(t) is usually achieved by adaptation techniques which drives e(t) towards zero, the simplest and commonest being called the ‘least mean squares’ (LMS) algorithm. Its drawback is that the time taken to adapt is dependent on the signal characteristics. Other algorithms such as the ‘recursive least squares’ (RLS), or ‘affine projection’ (AP) give better performance but greatly increased processing requirements.
Monophonic echo cancellation is a mature technology, and widely used in telecommunications—loud-speaking telephones, teleconference systems, network echo control and data transmission are good examples. For “artificial” spatialised sound systems, in which a mono signal is reproduced in two or more loudspeakers, manipulated linearly in gain and delay to seem to originate from the required direction, monophonic echo cancellation techniques can be used. However, ‘real’ spatial sound, in which a multi-channel signal is transmitted and the sending room characteristics are recreated in the receiving room, presents fundamental echo cancellation problems.
For multiple channel echo cancellation the number of echo paths is the product of the number of microphones and the number of loudspeakers in room B, as shown in FIG. 4 for a system with two of each. FIG. 4 shows the echo adaptation units 62L, 62R associated with each of two input channels 41L, 41R for a microphone 12R, whose outputs are combined in a combiner unit 91 to provide a cancellation signal ĥ(t). Note that further adaptation units will be required for the other path 42L (not shown) associated with the other microphone 12L.
The problem may thus be seen as one of characterising an unknown system having several inputs (gL(t), gR(t)) for each output ĥ(t). The result is that adaptation algorithms can only converge to a correct solution if xL(t) and xR(t) are completely independent. This is unlikely for a stereo (or multichannel) signal in which the original sources are associated. One approach for a motionless source is to assume that one channel may be derived from the other, allowing a single channel echo canceller 62 to model both echo paths hL(t), hR(t). This is illustrated in FIG. 5. Causality is maintained by a selector 60 taking the canceller input from the channel (gL(t) or gR(t)) with less delay. The drawback here is that the canceller 62 is now modelling both h(t) and g(t). If the source moves, or a different speaker starts talking, g(t) will change (the relative gains and delays of gL(t) and gR(t)), requiring re-adaptation. It also assumes both channels of g(t) have stable inverses.
Another approach is to try to reduce the cross correlation between channels. This must be done without degrading the audio quality or stereo image. Several methods have been tried. Adding independent noise sources to each channel has been proposed, but the noise is audible. Frequency-shifting one channel relative to the other disturbs the stereo image. Decorrelating filters have also been tried, but adequate decorrelation cannot be obtained. Adding non-linear distortion has been claimed to be inaudible due to psycho-acoustic masking effects. The use of interleaving Comb filters, in which at any given frequency only one channel contains energy is also reported to work, but performance is poor below 1 kHz.
Unless the decorrelation is very good a fast adaptation algorithm such as stereo fast RLS (recursive least squares) will be required. These are computationally very expensive, requiring the order of 28 multiplications and 28 additions for each filter coefficient. For example, a 75 ms stereo fast RLS echo canceller sampling at 16 kHz requires more than109 multiplications per second. Since even the most modern DSP devices (e.g. Texas TMS320C6X) only just manage 2×108 multiplications per second.