Acoustic echo is a common problem with full duplex audio systems, for example, audio conferencing systems and/or speech recognition systems. Acoustic echo originates in a local audio loop back that occurs when an input transducer, such as a microphone, picks up audio signals from an audio output transducer, for example, a speaker, and sends it back to an originating participant. The originating participant will then hear the echo of the participant's own voice as the participant speaks. Depending on the delay, the echo may continue to be heard for some time after the originating participant has stopped speaking.
Consider the scenario where a first participant at a first physical location with a microphone and speaker and a second participant at a second physical location with a microphone and speaker are taking part in a call or conference. When the first participant speaks into the microphone at the first physical location, the second participant hears the first participant's voice played on speaker(s) at the second physical location. However, the microphone at the second physical location then picks up and transmits the first participant's voice back to the first participant's speakers. The first participant will then hear an echo of the first participant's own voice with a delay due to the round-trip transmission time. The delay before the first participant starts hearing the echo of the first participant's own voice, as well as how long the first participant continues to hear the first participant's own echo after the first participant has finished speaking depends on the time it takes to transmit the first participant's voice to the second participant, how much reverberation occurs in the second participant's room, and how long it takes to send the first participant's voice back to the first participant's speakers. This delay may be several seconds when the Internet is used for international voice conferencing.
Acoustic echo can be caused or exacerbated when sensitive microphone(s) are used, as well as when the microphone and/or speaker gain (volume) is turned up to a high level, and also when the microphone and speaker(s) are positioned so that the microphone is close to one or more of the speakers. In addition to being annoying, acoustic echo can prevent normal conversation among participants in a conference. In full duplex systems without acoustic echo cancellation, it is possible for the system to get into a feedback loop which makes so much noise the system is unusable.
Conventionally, acoustic echo is reduced using audio headset(s) that prevent an audio input transducer (e.g., microphone) from picking up the audio output signal. Additionally, special microphones with echo suppression features can be utilized. However, these microphones are typically expensive as they may contain digital signal processing electronics that scan the incoming audio signal and detect and cancel acoustic echo. Some microphones are designed to be very directional, which can also help reduce acoustic echo.
Acoustic echo can also be reduced through the use of a digital acoustic echo cancellation (AEC) component. This AEC component can remove the echo from a signal while minimizing audible distortion of that signal. This AEC component must have access to digital samples of the audio input and output signals. These components process the input and output samples in the digital domain in such a way as to reduce the echo in the input or capture samples to a level that is normally inaudible.
An analog waveform is converted to digital samples through a process known as analog to digital (A/D) conversion. Devices that perform this conversion are known as analog to digital converters, or A/D converters. Digital samples are converted to an analog waveform through a process known as digital to analog (D/A) conversion. Devices that perform this conversion are known as digital to analog converters, or D/A converters. Most A/D and D/A conversions are performed at a constant sampling rate. Inexpensive silicon chips that do both A/D and D/A conversion on the same chip are widely available. Usually these chips are designed to be connected to a crystal which is used to generate a stable and fixed frequency clock signal. This clock signal is used to drive the A/D and/or D/A conversion process. Normally this clock is running at a very high frequency, and is divided down to a much lower rate which is the sampling rate driving the conversion process. The rate at which digital samples are produced by an A/D converter is determined by the frequency of the clock which is driving the A/D converter as well as the divider used to reduce that frequency to the desired sampling rate. The rate at which digital samples are consumed by a D/A converter is also determined by the frequency of the clock which is driving the D/A converter and the divider used to reduce that frequency to the desired sampling rate. As long as the A/D and D/A converters are driven by single clock and they are divided down by the same divider, they will sample at the same frequency and the relationship between the input and output samples will not change over time. In any period of time, the A/D will produce exactly the same number of samples as are consumed by the D/A.
Crystals have varying levels of performance. Some of the parameters that can be specified for a crystal are frequency, stability, accuracy (in parts per million, or ppm), as well as limits on the variation in the above parameters due to temperature changes. In general, no two crystals are exactly the same. They will oscillate at slightly different frequencies, and their other characteristics will differ as well. This means that if the A/D and D/A converters are driven by clock signals derived from different crystals, there will be a slight difference in the rate at which those converters will run, even when the crystals run at the same nominal frequency, and the dividers for the A/D and D/A match. In this case, the number of samples produced over time by the A/D will not match the number of samples consumed in the same period of time by the D/A. The longer this period of time during which the number of samples generated by the A/D is compared to the number of samples consumed by the D/A, the greater the difference in the number of samples processed by the A/D and D/A.
This clock drift can also occur when the A/D and D/A are driven by the same clock, but are running at different sample rates. If those differing rates are generated by dividers that approximate the sample rate, but are not exact, and then those rates that are slightly off, are converted from their nominal but not exact rate to the same rate by sample rate converters that are part of an AEC system, then there will be a drift between the capture and render sample rates even though the A/D and D/A are driven by the same clock. For example, many modern inexpensive codecs used on computer sound cards, are driven by a clock signal of 14.318184 MHz. This is a clock frequency that has been supported in personal computers for over 20 years. Crystals for this frequency are therefore very inexpensive. However, standard sampling rates of 44100 Hz and 48000 Hz do not evenly divide into 14.318184 MHz. This means that this type of codec will not be able to sample at the above frequencies with very high accuracy of the sample rate. An example calculation of the actual rates produced by such codecs follows below. Unfortunately the rates are much less accurate than the ppm accuracy of most crystals—which are normally accurate to within 100 ppm.
Acoustic echo cancellation components work by subtracting a filtered version of the audio samples sent to the output device from the audio samples received from the input device. This processing assumes that the output and input sampling rates are exactly the same. Because there are a wide variety of input and output devices available for PC devices, it is important that AEC work even when the input and output devices are not the same. Additionally, many USB cameras have a built in microphone that can be used for capturing audio. It is important that AEC be able to utilize this capture signal while the playback device be one that was shipped with the computer and is generally not a USB device. Unless the AEC component can function properly in these types of scenarios, effective acoustic echo cancellation will be difficult or impossible and that will result in a frustrating experience for end user(s).
A full duplex audio system has a render device and a capture device. The render device has a digital to analog converter (D/A) that converts digital samples to an analog voltage level at a rate set by a render clock. The capture device has an analog to digital converter (A/D) that converts an analog voltage level to digital samples at a rate set by a capture clock.
When the D/A and the A/D are driven by the same clock signal, and are sampling at the same sample rate, there is no need to compensate for differences in the sample rates, because they are exactly identical. However, when the D/A is driven by a first clock signal and the A/D is driven by a second clock signal, the first clock signal and the second clock signal will not be running at exactly the same rates. The rates may differ by only 1 part per million (1 ppm) or even by only 1 part per billion (1 ppb), but over time this means that the number of samples consumed by the D/A will differ from the number of samples produced by the A/D. Most AEC algorithms are not designed to properly operate for long periods of time when the D/A and A/D sample rates are not exactly the same. In addition, most clock signals derived from separate crystals differ by much more than 1 ppm. This means that it takes only a few minutes before the number of samples consumed by the D/A differs significantly from the number of samples produced by the A/D. For example, assume that an A/D and D/A are both running at a nominal sample rate of 16 kHz, but that their clocks differ by 80 ppm. This means that every 1600000 samples produced by the A/D, the D/A consumes 1600128 samples if it is running faster than the A/D. So every 100 seconds, the difference in the number of samples increases by another 128. In another example, assume the A/D and D/A are driven by the same clock, but are running at different sample rates, and that the clock signal is not exactly divisible by the sample rates. If the common crystal frequency of 14.318184 MHz and common sample rates of 44100 Hz, and 48000 Hz are chosen, then the dividers with the least amount of error for those 2 rates are 325 and 298. This means actual sample rates of about 44055.95 Hz, and about 48047.60 Hz are obtained. If these rates are both converted to a nominal 16000 Hz rate assuming that they really were 44100 Hz and 48000 Hz, rates of about 15984.02 Hz and 16015.87 Hz are obtained. These sample rates differ by about 1992 ppm! The difference between these rates is 3 1.85 samples every second.
In both of the above cases, if the sample rate differences are not properly compensated for, the AEC algorithm will be unable to properly cancel the echo over extended periods of time. The larger the difference between the actual sample rates of the A/D and D/A the quicker the AEC algorithm will fail to cancel the echo. With a good clock drift compensation algorithm, the AEC algorithm can properly cancel the echo indefinitely.