The present invention relates to acoustic echo cancellation and more specifically to stereophonic acoustic echo cancellation.
The evolution of teleconferencing to a more lifelike and transparent audio/video medium depends upon, among other things, the evolution of teleconferencing audio capabilities. The more realistic the sound, the more lifelike a teleconference will be and the more people and businesses will use it. Some present-day teleconferencing systems have already evolved to the point of including high-fidelity audio systems (100-7000 Hz bandwidth). These systems provide a significant improvement over older telephone systems (200-3200 Hz bandwidth). However, such high fidelity systems are by no means the limits of audio evolution in teleconferencing.
Spatial realism is highly desirable for audio/video teleconferencing. This is because of the need of a listener to follow, for example, a discussion among a panel of dynamic, multiple, and possibly simultaneous talkers. The need for spatial realism leads to consideration of multi-channel audio systems in teleconferencing, which, at a minimum, involves two channels (i.e., stereophonic).
Many present-day teleconferencing systems have a single (monophonic) full-duplex audio channel for voice communication. These systems, which range from simple speaker-phones to modern video teleconferencing equipment, typically employ acoustic echo cancelers (AECs) to remove undesired echos that result from acoustic coupling. This coupling results when sound, emitted from the teleconference loudspeaker (in response to a signal from a remote location), arrives at the teleconference microphone in the same room. The microphone generates a signal in response to this sound (i.e., this echo). This microphone signal is then transmitted to the remote location. An AEC employs an adaptive filter to estimate the impulse response from the loudspeaker to the microphone in a room in which an echo occurs and to generate a signal which is subtracted from the receiver signal to cancel that echo electrically. Like monophonic teleconferencing, high-quality stereophonic teleconferencing requires AEC. (See, e.g., M. M. Sondhi and D. R. Morgan, xe2x80x9cAcoustic echo cancellation for stereophonic teleconferencing,xe2x80x9d Proc. IEEE ASSP Workshop Appls. Signal Processing Audio Acoustics, 1991, which is incorporated herein by reference).
Stereophonic AEC presents a problem which does not exist in the monophonic context. In monophonic teleconferencing systems, a single adaptive filter is used to estimate a single impulse response from the loudspeaker to the microphone in the room experiencing an echo. There is only one impulse response to estimate because there is only one loudspeaker and one microphone in the room. As the adaptive filter impulse response estimate approaches the true impulse response of the room, the difference between these responses approaches zero. Once their difference is very small, the effects of echo are reduced. The ability to reduce echo is independent of the signal from the loudspeaker, since the real and estimated impulse responses are equal (or nearly so) and both the room (with its real impulse response) and the adaptive filter (with its estimated impulse response) are excited by the same signal.
In multi-channel stereophonic teleconferencing systems, multiple (e.g., two) adaptive filters are used to estimate the multiple (e.g., two) impulse responses of the room. Each adaptive filter is associated with a distinct acoustic path from a loudspeaker to a microphone in the receiving room. Rather than being able to independently estimate the individual impulse responses of the room, conventional stereophonic AEC systems derive impulse responses which have a combined effect of reducing echo. This limitation on independent response derivation is due to the fact that the AEC system can measure only a single signal per microphone. This signal is the sum of multiple acoustic signals arriving at a single microphone through multiple acoustic paths. Thus, the AEC cannot observe the individual impulse responses of the room. The problem with deriving impulse response estimates based on the combined effect of reduced echo is that such combined effect does not necessarily mean that the actual individual impulse responses are accurately estimated. When individual impulse responses are not accurately estimated, the ability of the AEC system to be robust to changes in the acoustic characteristics of the remote location is limited, commonly resulting in undesirable lapses in performance. (See, e.g., M. M. Sondhi, D. R. Morgan, and J. L. Hall, xe2x80x9cStereophonic Acoustic Echo Cancellationxe2x80x94An Overview of the Fundamental Problem,xe2x80x9d IEEE Signal Processing Lett., Vol. 2, No. 8, August 1995, pp. 148-151, which is incorporated herein by reference.)
FIG. 1 presents a schematic diagram of a conventional stereophonic (two-channel) AEC system in the context of stereo teleconferencing between two locations. A transmission room 1 is depicted on the right of the figure. Transmission room 1 includes two microphones 2, 3 which are used to pick up signals from an acoustic source 4 (e.g., a speaking person) via two acoustic paths that are characterized by the impulse responses g1(t) and g2(t). (For clarity of presentation, all acoustic paths are assumed to include the corresponding loudspeaker and/or microphone responses.) Output from microphones 2, 3 are stereophonic channel source signals x2(t) and x1(t), respectively. These stereophonic channel source signals, x2(t) and x1(t), are then transmitted via a telecommunications network (such as a telephone or an ATM network) to loudspeakers 11, 12 in a receiving room 10 (shown on the left). For convenience, this direction will herein be termed the upstream direction and transmissions in the opposite direction, i.e., from room 10 to room 1, will be termed the downstream direction. The terms upstream and downstream are intended to have no particular connotation other than to differentiate between two directions. Loudspeakers 11, 12 are acoustically coupled to microphone 14 in receiving room 10 via the paths indicated with impulse responses h1(t) and h2(t). These are the paths by which acoustic echo signals arrive at microphone 14.
The output of the microphone 14 is signal y(t), which is a signal representing acoustic signals in the receiving room impinging on the microphone. These acoustic signals include the acoustic echo signals. Loudspeakers 11, 12 are also acoustically coupled to microphone 13 by other acoustic paths. For clarity of presentation, however, only the coupling to microphone 14 and AEC with respect to its output will be discussed.
Further, those of ordinary skill in the art will recognize that the analysis concerning AEC for the output of microphone 14 is applicable to the output of microphone 13 as well. Similarly, those skilled in the art will recognize that AEC as performed for the outputs of microphones 13 and 14 in receiving room 10 also may be advantageously performed for the outputs of microphones 2 and 3 in transmitting room 1, wherein the functions of receiving room 10 and transmitting room 1 are swapped.
If nothing were done to cancel the acoustic echo signals in receiving room 10, these echoes would pass back to loudspeaker 5 in transmission room 1 (via microphone 14 and the telecommunications network) and would be circulated repeatedly, producing undesirable multiple echoes, or even worse, howling instability. This, of course, is the reason that providing AEC capability is advantageous.
Conventional AECs typically derive an estimate of the echo with use of a finite impulse response (FIR) filter with adjustable coefficients. This xe2x80x9cadaptablexe2x80x9d filter models the acoustic impulse response of the echo path in the receiving room 10. FIG. 1 generally illustrates this technique with use of AEC 20 using two adaptive FIR filters 16, 15 having impulse responses, ĥ1(t) and ĥ2(t), respectively, to model the two echo paths in the receiving room 10. Filters 16, 15 may be located anywhere in the system (i.e., at the transmitting room 1, in the telecommunications network, or at the receiving room 10), but are preferably located at the receiving room 10.
Driving these filters 16, 15 with the upstream loudspeaker signals x1(t) and x2(t) produces signals ŷ1(t) and ŷ2(t), which are components of a total echo estimate. The sum of these two echo estimate component signals yields the total echo estimate signal, ŷ(t), at the output of summing circuit 17. This echo estimate signal, ŷ(t), is subtracted from the downstream signal y(t) by subtraction circuit 18 to form an error signal e(t). Error signal e(t) is intended to be small (i.e., driven towards zero) in the absence of near-end speech (i.e., speech generated in the receiving room).
In most conventional AEC applications, the coefficients of adaptive filters 15, 16 are derived using well-known techniques, such as the LMS (or stochastic gradient) algorithm, familiar to those of ordinary skill in the art. The coefficients are updated in an effort to reduce the error signal to zero. As such, the coefficients ĥ1(t) and ĥ2(t) are a function of the stereophonic signals, x2(t) and x1(t), and the error signal, e(t).
As mentioned above, unlike monophonic AECs, conventional stereophonic AECs do not independently estimate the individual impulse responses of a room. Rather, conventional stereophonic AEC systems derive impulse responses which have a combined effect of reducing echo. Unless individual impulse responses are accurately estimated, the ability of the AEC system to be robust to changes in the acoustic characteristics of the remote location is limited and undesirable lapses in performance may occur.
To see this problem in terms of the operation of the stereophonic teleconferencing system of FIG. 1, consider the following. The signal output from microphone 14 may be described as
y(t)=h1(t)*x1(t)+h2(t)*x2(t),xe2x80x83xe2x80x83(Eq. 1)
where h1 and h2 are the loudspeaker-to-microphone impulse responses in receiving room 10, x1 and x2 are stereophonic source signals provided to loudspeakers 11, 12, and xe2x80x9c*xe2x80x9d denotes convolution. (Sampled signals are assumed throughout so that the time index, t, is an integer.) The error signal, e(t), may be written as
e(t)=y(t)xe2x88x92ĥ1Tx1xe2x88x92ĥ2Tx2,xe2x80x83xe2x80x83(Eq. 2a)
where ĥ1 and ĥ2 are N-dimensional vectors of the adaptive filter coefficients and where x1=[x1(t), x1(txe2x88x921), . . . x1(txe2x88x92Nxe2x88x921)]T and x2 =[x2(t), x2(txe2x88x921), . . . x2(txe2x88x92Nxe2x88x921)]T are vectors comprising the N most recent source signal samples, with superscript T denoting a transpose operation. The error signal can be written more compactly as
e(t)=y(t)xe2x88x92ĥTx,xe2x80x83xe2x80x83(Eq. 2b)
where ĥ=[ĥ1T|ĥ2T]T is the concatenation of ĥ1, and ĥ2, and likewise, x=[x1T|x2T]T.
Assuming that N is large enough, the signal y(t) can be written as
y(t)=h1Tx1+h2Tx2=hTxxe2x80x83xe2x80x83(Eq. 3)
where h1 and h2 are the true impulse response vectors in the receiving room and where h=[h1T|h2T]T. In terms of h, we may rewrite (Eq. 2b) as
e(t)=(hxe2x88x92ĥ)Tx ={tilde over (h)}Txxe2x80x83xe2x80x83(Eq. 4)
where
{tilde over (h)}=ĥxe2x88x92hxe2x80x83xe2x80x83(Eq. 5)
is the impulse response misalignment vector.
Assume that e(t) has been driven to be identically zero. From (Eq. 4), it follows that
{tilde over (h)}1*x1+{tilde over (h)}2*x20.xe2x80x83xe2x80x83(Eq. 6)
For the single-talker situation depicted in FIG. 1, for example, this further implies
xe2x80x83[{tilde over (h)}1*g1+{tilde over (h)}2*g2]* s(t)=0,xe2x80x83xe2x80x83(Eq. 7)
where s(t) is the acoustic signal generated by the talker in the transmission room. In the frequency domain, (Eq. 7) becomes
[{tilde over (H)}1(jxcfx89)G1(jxcfx89)+{tilde over (H)}2(jxcfx89)G2(jxcfx89)]S(jxcfx89)=0,xe2x80x83xe2x80x83(Eq. 8)
where the Fourier transforms of time functions are denoted by corresponding uppercase letters.
Consider first a single-channel situation, say G2=0. In that case, except at zeroes of G1S, (Eq. 8) yields {tilde over (H)}1=0. Thus, complete alignment (i.e., ĥ1=ĥ) is achieved by ensuring that G1S does not vanish at any frequency. Of course, if the receiving room impulse response, h1, changes, then the adaptation algorithm of adaptive filters 15, 16 must track these variations.
In the stereophonic situation, on the other hand, even if S has no zeroes in the frequency range of interest, the best that can be achieved is
{tilde over (H)}1G1+{tilde over (H)}2G2=0.xe2x80x83xe2x80x83(Eq. 9)
This equation does not imply that {tilde over (H)}1={tilde over (H)}2 =0, which is the condition of complete alignment. The problem with stereo echo cancelers is apparent from (Eq. 9): even if the receiving room impulse responses, h1 and h2, are fixed, any change in G1 or G2 requires adjustment of {tilde over (H)}1 and {tilde over (H)}2 (except in the special case where {tilde over (H)}1={tilde over (H)}2=0). Thus, not only must the adaptation algorithm of filters 15, 16 track variations in the receiving room, it must also track variations in the transmission room. The latter variations are particularly difficult to track. For instance, if one talker stops talking and another starts talking at a different location in the room, the impulse responses, g1 and g2, change abruptly and by very large amounts.
J. Benesty, A. Gilloire, Y. Grenier, A frequency domain stereophonic acoustic echo canceller exploiting the coherence between the channels, Acoustic Research Letters Online, Jul. 21 1999, discloses a frequency domain algorithm for use in a stereophonic echo canceller that exploits the coherence between the channels.
As can be seen from the above discussion, therefore, the challenge is to devise an approach which (as in the case of a single-channel echo canceler) converges independently of variations in the transmission room. Also, note that if x1 and x2 in (Eq. 6) are uncorrelated, then (Eq. 6) implies that {tilde over (h)}1={tilde over (h)}2=0. For this reason, it is desirable to decorrelate x1 and X2.
FIG. 2 is a schematic diagram of a stereophonic teleconferencing system that includes circuitry for decorrelations x and x2 in accordance with the teachings of U.S. Pat. No. 5,828,756, incorporated herein by reference.
The system of FIG. 2 is identical to that of FIG. 1 except for the presence of non-linear signal transformation modules 25, 30 (NL), which have been inserted in the paths between microphones 3, 2 of transmission room 1 and loudspeakers 11, 12 of receiving room 10. By operation of non-linear transformation modules 25, 30, stereophonic source signals x1(t) and x2(t) are transformed to signals x1xe2x80x2(t) and x2xe2x80x2(t), respectively, where xe2x80x9cxe2x80x2xe2x80x9d indicates a transformed signal which (in this case) advantageously has a reduced correlation with the other transformed signal of the stereophonic system.
As with the system presented in FIG. 1, the filters of AEC 20 may be located anywhere within the system, but are preferably located at receiving room 10. Non-linear transformation modules 25, 30 also may be located anywhere (so long as receiving room 10 and AEC 20 both receive the transformed signals as shown), but are preferably located at transmitting room 1.
Specifically, in accordance with one embodiment of the device disclosed in U.S. Pat. No. 5,828,756, the signals x1(t) and x2(t) are advantageously partially decorrelated by adding to each a small non-linear function of the corresponding signal itself. It is well-known to those skilled in the art that the coherence magnitude between two processes is equal to one (1) if and only if they are linearly dependent. Therefore, by adding a xe2x80x9cnoisexe2x80x9d component to each signal, the coherence is reduced. However, by combining the signal with an additive component which is similar to the original signal, the audible degradation may be advantageously minimized, as compared to the effect of adding, for example, a random noise component. This is particularly true for signals such as speech, where the harmonic structure of the signal tends to mask the distortion.
FIG. 3 presents a schematic diagram of an illustrative non-linear transformation module which may be used to implement non-linear transformation modules 25, 30 of the system of FIG. 2 in accordance with the teachings of U.S. Pat. No. 5,828,756. In the schematic shown in FIG. 3, non-linear function module 32 is applied to the original signal, x(t), and the result is multiplied by a (small) factor, xcex1, with use of multiplier 34. The result is combined with the original signal, x(t), to produced the transformed signal, xxe2x80x2(t), as shown. In other words, for i=1, 2,
xixe2x80x2(t)=xi(t)+xcex1fi[xi(t)],xe2x80x83xe2x80x83(Eq. 10)
where functions f1 and f2 are advantageously non-linear. Thus, a linear relation between x1xe2x80x2(t) and x2xe2x80x2(t) is avoided, thereby ensuring that the coherence magnitude will be smaller than one. As will be obvious to those skilled in the art, such a transformation reduces the coherence and hence the condition number of the covariance matrix, thereby improving the misalignment. Of course, the use of this transformation is particularly advantageous when its influence is inaudible and does not have any deleterious effect on stereo perception. For this reason, it is preferable that the multiplier, xcex1, be relatively small.
In one illustrative embodiment of U.S. Pat. No. 5,828,756, the non-linear functions f1 and f2 as applied by non-linear function module 32 are each half-wave rectifier functions, defined as:                               f          ⁡                      (            x            )                          =                                            x              +                              "LeftBracketingBar"                x                "RightBracketingBar"                                      2                    =                      {                                                            x                                                                                            if                      ⁢                                              xe2x80x83                                            ⁢                      x                                         greater than                     0                                                                                                0                                                  otherwise                                                                                        (Eq.  11)            
In this case, the multiplier a of equation (Eq. 10) may advantageously be set to a value less than 0.5, and, preferably, to a value in the range 0.1 to 0.3.
The above-described solution proposed in U.S. Pat. No. 5,828,756 is a simple and efficient solution that overcomes the above-discussed problems by adding a small non-linearity into each channel. The distortion due to the non-linearity is hardly perceptible and does not affect the stereo effect, yet reduces interchannel coherence, thereby allowing reduction of misalignment to a low level. However, because the introduced distortion is so small (so as not to significantly affect sound quality), the echo cancellation algorithm must be very powerful in order to converge to a solution within a reasonably small period of time when conditions in the room change. A least mean squares (LMS) solution does not converge fast enough. A much more powerful algorithm is necessary in order to make the system illustrated in FIG. 2 work. Particularly, the solution of FIG. 2 has been fruitful only when combined with a two-channel fast recursive least-squares (FRLS) algorithm. However, a FRLS algorithm requires a high level of computational complexity and, therefore, a powerful processor to implement. Further, it is unstable and may diverge under certain conditions. Accordingly, the real-time implementation of this algorithm that is necessary in order to employ in a real world teleconferencing system is difficult to achieve.
The invention is a multiple channel teleconferencing system employing a stereophonic acoustic echo canceller that exploits the coherence between multiple channels. In accordance with the invention a small non-linearity is incorporated into each channel path between the microphone and the speaker and an efficient frequency domain adaptive algorithm is implemented in the echo canceler circuit. The frequency domain algorithm converges to a solution much more quickly, than, for instance, a time domain, FRLS, solution.