1. Technical Field
The present disclosure relates generally to electronic devices with audio speakers and microphones, and more particularly to electronic devices that incorporate acoustic echo cancellation.
2. Description of the Related Art
Audio playback systems of electronic devices are increasingly designed to produce high sound pressure output levels. In contrast to earpiece audio output levels for traditional handheld phone usage, these high sound pressure levels are sufficient to be used as a primary method of consuming multimedia content and for hands free communication. In addition, microphone sensitivity and an audio gain lineup for received audio is chosen such that the electronic device can be voice controlled from a distance of a meter or even multiple meters. The sensitivity and gain are configured to compensate for source-to-microphone path loss, which can exceed 20 dB. With loud playback and sensitive microphones in the same device, an echo cancellation system is often incorporated into the electronic devices. The demands to the echo control system in such electronic device can approach or in some case exceed those imposed on stationary teleconferencing systems. For example, unlike stationary teleconferencing systems which, once installed are calibrated for the specific acoustic conditions of a particular placement in a room. By contrast, the electronic devices are generally used in continually changing locations, and thus have to operate under unknown echo return conditions.
In voice recognition driven user devices with closely-spaced loudspeaker and microphone system, a large raw echo from the loudspeaker will be picked up by the microphone. The conventional way to cancel the echo is to use an adaptive filtering (AF)-based acoustic echo canceler (AEC). The conventional AEC models the acoustic path between loudspeaker output and microphone input with a linear filter and subtracts the echo replica from the microphone input signal. Using this conventional AEC, the best attenuation achieved is about 25 dB-30 dB if the system is linear and is operating with echo path magnitude and phase being static or varying very slowly. However, a portable or mobile loudspeaker and microphone system is more often positioned in an environment, where the relative positions of the electronic device, reflecting structures, and users are changing. In addition, system non-linearity introduced by the transducers, by vibrations in the body of the device and by other factors, can render the conventional AEC inadequate. The problem is made more acute for small electronic devices, such as speakerphones, which produce high sound pressure levels while incorporating voice control. The effects caused by nonlinearity and vibrations cannot be modeled completely by linear adaptive filters and thus conventional AEC cannot remove all of the echo. This residual echo from conventional AEC is a non-stationary noise-like signal correlated to and bearing the same characteristics as the downlink signal. This residual echo can be very disruptive when mixed in with user speech as an input to a voice recognition (VR) engine. Consequently, the speech of a user often cannot be recognized or can be mis-recognized by the VR engine. The residual echo presents challenges in voice communications too, as it reduces call quality and can give rise to user complaints.
In an attempt to address the deficiencies of linear modeling for echo cancelation, another conventional way to further reduce residual echo is to use a nonlinear processor (NLP) based on a voice activity detection (VAD) signal. NLP that is processed in the time domain tends to be very complicated and cannot be accurate, resulting in attenuating a user's speech. The NLP method is effective in reducing echoes for a downlink single talker case when a near-end talker is silent; however, the NLP method cannot reduce residual echo from mixed speech. In addition, the NLP method cannot improve the echo-to-speech ratio (ESR). Thus, the recognition accuracy of the VR engine will not be increased. Moreover, recognition accuracy may even be decreased because of reduced overall level of mixed speech and residual echo in the time domain NLP. Another clear drawback is that a delay between the residual echo and loudspeaker signal is unknown. Real-time changes occur in the echo path. The spectrum for both residual echo and loudspeaker signal cannot be precisely aligned. Therefore, the frequency dependent information such as attenuation gain will not be accurate.