Referring to FIG. 1, at present, voice of a person may be captured by using a microphone 102, and the voice is played by using a loudspeaker 106 after being processed by a computer sound card 104. Then, the played voice is transmitted and reflected in an environment, and is captured by the microphone. In this way, an echo is formed, which is also referred to as an aftersound. The echo affects the voice quality and makes it difficult for a listener to understand speech from a loudspeaker. This affects accurate expression of the speech, and it is necessary to take some measures to cancel the echo.
At present, echo cancellation may be implemented by using an acoustic echo cancellation (AEC) algorithm. A basic principle of the AEC algorithm is to subtract the echo from a captured near-end voice. Echo generation model may be complex. In a simplified model it may be roughly considered that the echo is equal to a played far-end voice. In this way, an echo cancellation process is to subtract the far-end voice from the near-end voice. Herein, the near-end voice refers to voice data captured from the microphone, and the far-end voice refers to voice data played by the loudspeaker.
At present, for a process of echo cancellation using the AEC algorithm, refer to FIG. 2. After voice data from the far-end is received, the voice data is placed in a receive buffer. Then the voice data is extracted from the receive buffer and is placed in a sound card play buffer for playing, and at the same time, the extracted voice data is also placed in a reference audio buffer. After being captured by the microphone together with input voice to be transmitted to the far-end, the voice data played by using the loudspeaker is placed in a sound card capture buffer, and then the voice data in the sound card capture buffer is sent to a near-end audio buffer. Audio frames are extracted from both the reference audio buffer and the near-end audio buffer by using a synchronization control module, and are sent to an echo cancellation module for echo cancellation processing using the AEC algorithm. Finally, echo-cancelled voice data is sent out to the far-end.
A function of the synchronization control module is to control the near-end voice in the near-end audio buffer to be aligned with the far-end voice in the reference audio buffer, so as to ensure that the echo cancellation algorithm achieves an optimal effect. The alignment herein does not require the near-end voice to be completely aligned with the far-end voice; instead, it requires that a delay is kept at a stable value, to avoid jitters and offsets.
It is assumed that a total delay of audio data from entry into the sound card play buffer to being captured and staying in the sound card capture buffer is a sound card delay value sndCardDelayMs, a queue length of the reference audio buffer is refBufLen, a queue length of the near-end audio buffer is nearBufLen, and a length of the audio frame is kFrameSize. If audio data buffered in the reference audio buffer and audio data buffered in the near-end audio buffer are completely synchronous, Formula (1) is true:(refBufLen−nearBufLen)/kFrameSize=sndCardDelayMs.  Formula (1):
Therefore, an objective of ensuring relative synchronization between the audio data buffered in the reference audio buffer and the audio data buffered in the near-end audio buffer is to make Formula (2) true:(refBufLen−nearBufLen)/kFrameSize=kRatio*sndCardDelayMs.  Formula (2):
where kRatio in Formula (2) is a coefficient greater than or equal to 1.
In Formula (2), refBufLen and nearBufLen are computable values in a running process of a computer. Formula (2) ensures that an audio frame contained in the reference audio buffer can be found for effectively cancelling echo in an audio frame from the new-end audio buffer being processed. Now, a problem is that for a particular device, the sound card delay value sndCardDelayMs is constant, but for different devices, sound card delay values sndCardDelayMs are generally different. Especially for a mobile terminal such as a mobile phone, the difference between devices is more obvious. At present, during echo cancellation processing, the sound card delay value sndCardDelayMs is set to a constant value, in, for example, audio software installed on different mobile terminals. For compatibility with different devices, the constant value chosen is generally less than an actual value of a sound card delay value sndCardDelayMs of a known device.
However, if the sound card delay value sndCardDelayMs uses a relatively small constant value, a length of the voice data in the reference audio buffer is kept at a relatively low level. For a device whose sound card severely jitters in time, data in the reference audio buffer may become empty, and the echo cannot be cancelled. For a device whose sound card delay value is relatively large, a processing range of the AEC echo cancellation algorithm may be exceeded, and as a result the echo may not be effectively cancelled. If the sound card delay value sndCardDelayMs uses a relatively large constant value, for a device whose sound card delay value is relatively small, it is possible that the used constant value is greater than a delay of the sound card. This may result in a negative delay, and consequently, the echo also cannot be effectively cancelled. Therefore, echo cancellation performed by setting the sound card delay value to a constant value has poor compatibility and effectiveness, and echo cancellation may fail easily.