Acoustic echo is a common phenomenon occurring in two-way voice communication when open speakers are used. For example, FIG. 1 illustrates one end 100 of a typical two-way communication system. The other end is exactly the same. In such a system, the far-end voice is played through a loud speaker 160 and captured by the microphone 110 in the system and sent back to the far end. The far-end user then hears his or her own voice with a certain delay.
There are a number of known approaches to reducing acoustic echo in two-way communication systems. However, these known approaches face particular problems when applied to voice communication systems using personal computers, such as internet telephony and voice chat applications on personal computers.
1. Acoustic Echo Cancellation
Acoustic Echo Cancellation (AEC) is a digital signal processing technology which is used to remove the acoustic echo from a speaker phone in two-way (full duplex) or multi-way communication systems, such as traditional telephone or modern internet audio conversation applications.
With reference again to the example near end 100 of a typical two-way communication system illustrated in FIG. 1, an Acoustic Echo Cancellation is used to remove echo of the far end user's voice. The example near end 100 includes a capture stream path and a render stream path for the audio data in the two directions. The far end of the two-way communication system is exactly the same. In the capture stream path in the figure, an analog to digital (A/D) converter 120 converts the analog sound mic(t) captured by microphone 110 to digital audio samples continuously at a sampling rate (fsmic). The digital audio samples are saved in capture buffer 130 sample by sample. The samples are retrieved from the capture buffer in frame increments (herein denoted as “mic[n]”). Frame here means a number (N) of digital audio samples. The index ‘n’ is used to indicate relative sampling instants for the frames. Finally, samples in mic[n] are processed, including encoding via a voice encoder 170 and sent to the other end.
In the render stream path, the system receives the encoded voice signal from the other end, decodes audio samples via voice decoder 180 and places the audio samples into a render buffer 140 in periodic frame increments (labeled “spk[n]” in the figure). Then the digital to analog (D/A) converter 150 reads audio samples from the render buffer sample by sample and converts them to an analog signal continuously at a sampling rate, fsspk. Finally, the analog signal is played by speaker 160.
In systems such as that depicted by FIG. 1, the near end user's voice is captured by the microphone 110 and sent to the other end. At the same time, the far end user's voice is transmitted through the network to the near end, and played through the speaker 160 or headphone. In this way, both users can hear each other and two-way communication is established. But, a problem occurs if a speaker is used instead of a headphone to play the other end's voice. For example, if the near end user uses a speaker as shown in FIG. 1, his microphone captures not only his voice but also an echo of the sound played from the speaker (labeled as “echo(t)”). In this case, the mic[n] signal that is sent to the far end user includes an echo of the far end user's voice. As the result, the far end user would hear a delayed echo of his or her voice, which is likely to cause annoyance and provide a poor user experience to that user.
Practically, the echo echo(t) can be represented by speaker signal spk(t) convolved by a linear response g(t) (assuming the room can be approximately modeled as a finite duration linear plant) as per the following equation:
      echo    ⁡          (      t      )        =                    spk        ⁡                  (          t          )                    *              g        ⁡                  (          t          )                      =                  ∫        0                  T          e                    ⁢                                    g            ⁡                          (              τ              )                                ·                      spk            ⁡                          (                              t                -                τ                            )                                      ⁢                                  ⁢                  ⅆ          τ                    where * means convolution, Te is the echo length or filter length of the room response.
In order to remove the echo for the remote user, AEC 210 is added in the system as shown in FIG. 2. When a frame of samples in the mic[n] signal is retrieved from the capture buffer 130, they are sent to the AEC 210. At the same time, when a frame of samples in the spk[n] signal is sent to the render buffer 140, they are also sent to the AEC 210. The AEC 210 uses the spk[n] signal from the far end to predict the echo in the captured mic[n] signal. Then, the AEC 210 subtracts the predicted echo from the mic[n] signal. This difference or residual is the clear voice signal (voice[n]), which is theoretically echo free and very close to the near end user's voice (voice(t)).
FIG. 3 depicts an implementation of the AEC 210 based on an adaptive filter 310. The AEC 210 takes two inputs, the mic[n] and spk[n] signals. It uses the spk[n] signal to predict the echo in the mic[n] signal. The prediction residual (difference of the mic[n] signal from the prediction based on spk[n]) is the voice[n] signal, which will be output as echo free voice and sent to the far end.
The actual room response (that is represented as g(t) in the above convolution equation) usually varies with time, such as due to change in position of the microphone 110 or speaker 160, body movement of the near end user, and even room temperature. The room response therefore cannot be pre-determined, and must be calculated adaptively at running time. The AEC 210 commonly is based on adaptive filters such as Least Mean Square (LMS) adaptive filters 310, which can adaptively model the varying room response.
The nature of adaptive filtering requires that the microphone signal and the reference or speaker signal must be accurately aligned. In basic terms, the AEC mode has to determine which samples in the speaker signal (spk[n]) are needed to predict the echo at a given sample in the microphone signal (mic[n]). In practical terms, the AEC operates on two streams (the microphone and speaker samples), which generally are sampled by two different sampling clocks and may each be subject to delays. Accordingly, the same indices in the two stream may not be necessarily aligned in physical time. On personal computers, timestamps are typically used to align the microphone and speaker signals, since the timestamp represents the physical time of when a sample is rendered (in the speaker stream) or captured (in the microphone stream). Frames of speaker spk[n] and microphone mic[n] signals are stored in separate data queues and the timestamps are used to make adjustments to the speaker (or microphone) data queues in order to align the speaker and microphone signals. A difference in render and capture sampling (clock) rates is called drift, and to compensate for this, periodic single sample adjustments commensurate with the drift rate are made to the speaker data queue. Also when a glitch occurs (i.e., data loss of one or multiple samples in the speaker or microphone streams) an adjustment of many samples of data may be made at once in the speaker data queue.
However, in practice, these timestamps are noisy and sometimes can be very wrong. One reason for this is that major operating systems, such as Microsoft Windows XP operating system, support numerous different audio devices. It is quite common that some audio device and its driver cannot provide accurate timestamps. In such case, the signals are often out of alignment, and the AEC fails to properly cancel echoes.
2. Voice Switching
Voice switching is a method used for half-duplex two-way communication. A typical example of such communication system has two signal channels: an incoming channel that receives the voice signal coming from the far-end, and an outgoing channel that sends the near end voice signal to the far-end. In a person-to-person scenario, the far-end may be another end user device. Alternatively, in a conference or multi-user scenario, the far end may be a server that hosts the multiple user conference. Based on voice activity being present at the two ends, the channels are selectively turned on or off. In other words, whenever there is voice activity in one channel, the other channel is turned off. By selectively switching off either incoming or outgoing channels based on voice activity in this way, the echo path is broken, which effectively removes acoustic echoes. The drawback of voice switching, however, is that it provides only half-duplex mode of communication, resulting in loss of easy interruptability in conversations.
Voice switching is commonly used on low-end desktop phones in speaker phone mode. A basic voice switching algorithm simply compares the strength of near-end and far-end voices and turns on the communication channel for the end with the stronger voice. It is relatively simple to compare voice activity on a standalone or dedicated phone device, because the microphone and speaker gains are known. During double talk scenarios (i.e., in which both ends are talking simultaneously), it is easy to estimate echo strength and thus easy to compare which voice is stronger. However, for voice communication applications on personal computers, any microphone or speaker may be connected to the computer, and the gains could be adjusted by the users at any time. This complicates the ability to estimate the echo strength, and therefore to compare the voice strength on the channels to accurately determine which channel should be switched on.