A device for audio-based communication, including a voice-controlled audio playback system, typically includes both a loudspeaker and a microphone. The loudspeaker may used to play back audio signals received from a remote (“far-end”) source, while the microphone is used to capture audio signals from a local (“near-end”) source. In the case of a voice-controlled audio playback system, for example, the far-end source may include video content from a network source or a disk, and the near-end source may include a viewer's speech commands. As another example, in the case of a telephone call, the near- and far-end sources may be people engaged in a conversation, and the audio signals may contain speech. An acoustic echo occurs when the far-end signal emitted by the loudspeaker is captured by the microphone, after undergoing reflections in the local environment.
An acoustic echo canceller (“AEC”) may be used to remove acoustic echo from an audio signal captured by a microphone in order to facilitate improved communication. For example, the AEC may filter the microphone signal by determining an estimate of the acoustic echo (e.g., the remote audio signal emitted from the loudspeaker and reflected in the local environment). The AEC can then subtract the estimate from the microphone signal to produce an approximation of the true local signal (e.g., the user's utterance). The estimate can be obtained by applying a transformation to a reference signal that corresponds to the remote signal emitted from the loudspeaker. In addition, the transformation can be implemented using an adaptive algorithm. For example, adaptive transformation relies on a feedback loop, which continuously adjusts a set of coefficients that are used to calculate the estimated echo from the far-end signal. Different environments produce different acoustic echoes from the same loudspeaker signal, and any change in the local environment may change the way that echoes are produced. By using a feedback loop to continuously adjust the coefficients, an AEC to can adapt its echo estimates to the local environment in which it operates.
In addition, communication devices may also include a residual echo suppressor (“RES”). Various factors, including nonlinearity and noise, can cause an echo to not be completely eliminated by an acoustic echo canceller. A residual echo suppressor may be used to further reduce the level of echo that remains after processing by an acoustic echo canceller. For example, residual echo suppressors may use non-linear processing to further reduce the echo level. However, even after processing by a residual echo suppressor, some residual echo may remain.
Residual echo that remains after an echo cancellation process may interfere with speech recognition. For example, when “double talk” is present, a microphone signal will include both the near-end speech signal and the acoustic echo. If the residual echo is too large relative to the speech signal, recognition of the near-end speech may be difficult.
If near-end speech is detected in an audio input signal, a controller may attenuate the audio playback signal in order to reduce the residual echo that may interfere with speech recognition. For example, when near-end speech is detected, the controller may attenuate the audio playback signal by a fixed amount (e.g., by N dB). However, if the attenuation amount is too great, the disruption to the playback signal may be noticeable to the listener. If the attenuation amount is too small, the remaining residual echo may continue to interfere with speech recognition.
Alternatively, when near-end speech is detected, the controller may attenuate the audio playback signal to a fixed target level. However, similar problems may result. If the target level for the audio playback signal is too low, the disruption to the playback signal may be noticeable to the listener. If the target level for the audio playback signal is too high, the remaining residual echo may continue to interfere with speech recognition.