1. Field of the Invention
The present invention relates to clock drift compensation, and in particular clock drift compensation methods, apparatuses and computer program product for echo cancellers implemented on computer based systems.
2. Discussion of the Background
In a conventional conferencing system or any hands free (e.g. not a handheld telephone held to a person's ear) system, one or more microphones capture a sound wave at a site A, and transforms the sound wave into a first audio signal (i.e., an electric signal that conveys the audio content). The first audio signal is transmitted to a site B, where a television set, or an amplifier and loudspeaker, reproduces the original sound wave by converting the first audio signal generated at site A into the sound wave, which is audible to the human ear.
FIG. 1 illustrates a typical echo problem in a hands free communication systems. A digital audio signal 1101 from a far end (site A) is converted from the acoustic domain into the analog electronic domain by the digital to analog converter (DAC) 1301, amplified in the loudspeaker amplifier 1302 and then converted to acoustic signals by the loudspeaker 1303. Both the direct acoustic signal 1304 and reflected versions 1306, reflected by walls/ceilings etc. 1305 are undesirable picked up by the microphone 1308. The microphone also picks up the desired near end signal 1307 (e.g. person's voice in the same room as microphone 1308). The microphone signal is amplified in the microphone amplifier 1309 and digitized by the analog to digital converter 1310, outputting an uncancelled microphone signal 1202.
If the uncancelled microphone signal 1202 were transmitted to the far end, the participants at the far end site would hear an echo(s) of themselves, and if a similar system were present at the far end, even howling/feedback might occur.
To deal with this problem, it has been proposed to add an acoustic echo canceller 1203 to the digital microphone signal path. This canceller 1203 uses the digital loudspeaker signal 1101 as a signal reference 1201, and estimates all of the loudspeaker to microphone paths 1304/1306, and subtracts these estimates from the uncancelled microphone signal 1202, making the cancelled microphone signal 1204, which is transmitted to the far end, as signal 1102.
Two main approaches are widely used for acoustic echo cancellers today: a full band canceller and a sub band canceller. Both of these approaches normally use adaptive FIR (finite impulse response) filters for the echo path estimating, however applying these in full band domains and sub band domains, respectively.
An acoustic echo canceller used in a product will typically include several further sub blocks not shown in the figures in this document; a double talk algorithm, a non-linear processing unit, comfort noise generation, etc. For simplicity, these sub blocks are omitted. These blocks may vary and also are well documented in papers, patents and literature. For a person skilled to the acoustic processing art, integrating of these blocks in a signal processing stream is straightforward.
FIG. 2 illustrates features of a basic full band acoustic echo canceller.
The digital signal from far end 2101 is passed to the loudspeaker as signal 2102 and is also used as the loudspeaker reference signal 2103.
The loudspeaker reference signal 2103 is filtered through the adaptive FIR filter 2104. This adaptive filter converges to and tracks the impulse response of the room in which the microphone is located. For the initial convergence, and for any acoustic changes in the room (door opens, people move, etc.), the adaptive FIR filter 2104 has to adapt. Many different adaptive algorithms can be used for this purpose, from the inexpensive (low processing power) least mean square (LMS) to more sophisticated and more resource demanding algorithms as affine projection algorithm (APA) and recursive least squares (RLS). However, in common, all these algorithms use the FIR filter update loop 2108 for adapting.
The adaptive FIR filter outputs an inverted echo estimate 2105, which is added to the uncancelled microphone signal 2106, calculating the echo cancelled microphone signal 2107.
The other approach uses subband processing. FIG. 3 illustrates this approach.
The digital signal from the far end 3101 is passed to the loudspeaker as signal 3102. It is also divided into a chosen number of subbands using the analyze filter 3301.
The uncancelled microphone signal 3106 is divided into subbands using another (but equal) analyze filter 3302.
For each subband, the loudspeaker analyze filter 3301 outputs a subband reference signal 3203, which is filtered through a subband FIR filter 3204, calculating an inverted subband echo estimate 3205. The microphone analyze filter 3302 outputs a subband uncancelled signal 3206, which is added to the inverted echo estimate, outputting a subband echo cancelled microphone signal 3207. The echo cancelled microphone signal is used for the adapting of the FIR filter, shown as the subband FIR filter update loop 3208.
The echo cancelled microphone signals from all subbands are also merged together to a fullband cancelled microphone signal 3107 by the synthesize filter 3303.
Both the fullband and subband echo cancellers estimates the response from output digital samples (2102/3102) to input digital samples (2106/3106). This response is affected by any software or hardware that the signal passes through, including but not limited to sampling rate converters, mixers, the D/A converter, the loudspeaker, the acoustic coupling, the microphone and the A/D converter. It is inherent in the design that the rate of samples in the input signal 2106/3106 equals the rate of samples in the output signal 2102/3102. For best performance, the response (including the delay) of controllable parts should be kept constant.
In well designed systems, the equal sampling rate is ensured by using the same clock source for the D/A and A/D converters, whereas constant (or at least predictable) delay is maintained by proper hardware and software design.
However, in some designs, as recognized by the present inventors, different clock sources are used for the A/D and the D/A converter. This is, for example, the case in personal computers (PCs), where the A/D converter and D/A converter can be placed on different cards, with conversion clocks generated locally on each card respectively. A typical, and widely used situation, is the case where audio is captured (A/D converted) by a Web-camera, while the audio is played out (D/A converted) by the PC's audio card.
Any difference in rates between the A/D converter and the D/A converter may cause several problems:                1. Frequency shift: There may be a frequency shift between the signals from the 2102/3102 signal to the 2106/3106 signal. The linear echo canceller is not designed for such a shift, and thus the maximum obtainable instantaneous performance suffers.        2. Time drift: The time between the same samples in speaker signal 2102/3102 and the microphone signal 2106/3106 may change slowly, requiring the echo canceller to constantly readapt. The echo canceller can only readapt when the speaker signal 2102/3102 has adequately high power. Therefore, although the time delay changes slowly, the effective time shift in response after a period of silence (low 2102/3102 power) can be sufficiently big to result in considerable residual echo.        3. Overproduction/underproduction of samples: Since the production of samples for the A/D converter differs from the consummation from the D/A converter, there may be a congestion or lack of samples one or more places in the system.        
Two types of drift may be present between the A/D conversion rate and D/A conversion rate. Both may be present at one time.
Drift occurs due to the clock source (crystal, oscillator, etc.) deviation from its nominal value. Crystals have varying levels of performance. Some of the parameters that can be specified for a crystal are frequency, stability, accuracy (in parts per million, or ppm), as well as limits on the variation in the above parameters due to temperature changes. In general, no two crystals are exactly the same. They will oscillate at slightly different frequencies, and their other characteristics will differ as well. This means that if the A/D and D/A converters are driven by clock signals derived from different crystals, there will be a slight difference in the rate at which those converters will run, even when the crystals run at the same nominal frequency, and the dividers for the A/D and D/A match. In this case, the number of samples produced over time by the A/D will not match the number of samples consumed in the same period of time by the D/A. The longer this period of time during which the number of samples generated by the A/D is compared to the number of samples consumed by the D/A, the greater the difference in the number of samples processed by the A/D and D/A.
Drift can also occur due to incompatible sample rates. When a capture/playout device does not support the sample rate of the audio stream, a software sample rate converter is inserted by the operating system. However, this sample rate converter may have a limited resolution, and thus the nominal sampling frequency will vary. The difference is constant over time, but can be considerably big. A typical value often experienced is 0.625%, i.e. 6250 ppm.
FIG. 4 illustrates a typical setup of a playout and capture system (4100 and 4200 respectively) in a PC. It should be noted that the exact setup varies with the chosen application programming interface (API) and with the audio playout/capture device driver. Often, to make systems work, double buffering or similar techniques are used. This is not shown in FIG. 4. The figure is made only to explain the main properties of the playout and capture system, as seen from the software application.
In the playout system 4100, the DAC (digital to analog converter) 4101 is clocked by the DACCLK 4102, i.e. the DAC 4101 processes samples with a rate dictated by DACCLK 4102. The DACCLK 4102 is usually derived from a much higher frequency of the crystal oscillator. The DAC 4101 processes one sample at a time from the DAC FIFO 4103. The DAC FIFO can be implemented both in hardware or software. When the DAC FIFO 4103 is empty, it retrieves Nplayout samples from the playout SRC (sample rate converter) 4104, which again takes a number of samples from the playout ring buffer 4112, which is part of the playout FIFO 4110. Nplayout can be as low as one, but larger numbers (groups of samples) are also common (e.g., Nplayout=128 has been observed). Each situation varies depending on the make, model, software and components used in respective PC's. The playout read pointer 4113 is updated with the same number of samples taken from the ring buffer 4112 of the playout FIFO 4110. It is the software application's task to ensure that it fills the correct number of samples in the playout FIFO 4110, from the playout write pointer 4111. One exemplary software application is a software based echo cancellation application employed at an endpoint terminal used with a videoconference system. This software application may be preinstalled on the PC, distributed on a physical media, or downloaded over a network from a server.
Similarly, in the capture portion of the system, the ADC 4201 is clocked by the ADCCLK 4202, i.e. it produces samples with a rate of ADCCLK 4202. ADCCLK 4202 is usually derived from a much higher frequency provided by a crystal oscillator, but as stated before, not necessarily the same as the ADCCLK. The ADC delivers one sample at a time to the ADC FIFO 4203. The ADC FIFO can be implemented in hardware, or software, or a hybrid. When the ADC FIFO 4203 is full, it delivers Ncapture samples to the capture SRC (sample rate converter) 4204, which again delivers the calculated number of samples to the capture ring buffer 4212 (directly or indirectly), which is part of the capture FIFO 4210. Ncapture can be as low as one, but higher numbers are also common. The capture read pointer 4213 is updated with the same number of samples delivered to the capture ring buffer 4212 of the capture FIFO 4213. It is the software application's 4400 task to ensure that it processes the correct number of samples from the capture FIFO 4210, read from the capture read pointer 4211.
The software application 4400 transmits and receives samples to/from the playout and capture FIFOs, respectively.
For applications reading/writing audio data to a file or other unclocked sources, it is usually simple to produce consume the correct number of samples to/from the playout/capture FIFO. Even for simplex applications getting/delivering audio data from/to another clocked source/sink, correcting the number of samples delivered/processed to/from the playout/capture FIFO is usually rather straightforward, by either inserting or removing one or more samples. Such insertions or removals can be performed without audible degradations, and techniques for this are well known. As these techniques either insert or remove samples, there will be time delay changes, but in most applications this is acceptable.
However, as recognized by the present inventors for applications where an exact relationship between the samples delivered and the samples processed by the software application is critical, another solution must be found. This is the case for echo canceling. It should be pointed out that there are also other applications with this demand, for instance measuring applications, etc.