In audio conferencing between different sites, conference participants (also called talkers) at different sites communicate with each other by sending audio signals between the sites. Each site has a microphone to receive sound from that site and each site transmits the received audio signal to the other site(s). Each site also receives audio signals from the other site(s) and has a loudspeaker to generate sound from the received audio signal. As used herein, a loudspeaker is an audio output device (an electromechanical device), not a person speaking. At a local site (also called a near end), the remote audio signal, sent by a remote site (also called a far end) is rendered by the loudspeaker. The rendered remote audio signal may be picked up by the microphone at the local site and be retransmitted back to the remote site. The echo and other sound distortions present at the local site and the signal transmission between sites may result in a remote conference participant hearing his/her own voice return over the audio conferencing system. This echo degrades the audio quality of the conference and leads to participant dissatisfaction.
To reduce the effects of return echo, audio conferencing systems typically apply acoustic echo cancelling techniques. In acoustic echo cancellation (AEC), one filters the incoming audio signals from the local microphone(s) to reduce the influence of sound rendered by the local loudspeakers. Acoustic echo cancellation estimates and substantially attenuates the effects of the remote audio signal. Acoustic echo cancellation typically includes adaptive elements (which respond differently according to input conditions) to accommodate changing conditions such as different talkers and moving objects at the local site.
Audio conferencing systems also may employ beamforming and multiple microphones to capture the participants' voices. Beamforming (BF) uses a group of microphones, such as a microphone array, to improve voice acquisition as compared to use of a single microphone. The combination of several microphone audio signals may form a directed beam of audio sensitivity that is more directional and selective than any of the individual microphones. The audio signal thus combined (which may be called the beam audio signal) may have better noise rejection of noise sources outside of the beam than the individual audio signals from the individual microphones. Beamforming systems typically are adaptive in that the beamforming is responsive to the input. Hence, changing talkers, movement of participants, and changing noise sources may be accommodated.
Some audio conferencing systems are configured for both beamforming and acoustic echo cancellation. However, integrating the technologies may involve system performance trade-offs. The two primary approaches to integrating beamforming and acoustic echo cancellation are called ‘AEC first’ and ‘beamforming first’ according to the order of operations.
In an AEC-first approach, acoustic echo cancellation is performed on each of the plurality of input audio signals coming from the microphones (e.g., directly from the microphones). Beamforming is performed on the plurality of echo-cancelled audio signals output from the plurality of acoustic echo cancellation operations. This approach has the benefit that each acoustic echo cancellation operation is performed on an audio signal from a fixed beam or microphone and, hence, the acoustic echo cancellation performance for each input is similar to that performed with a single input system. However, as the number of input audio signals increases, the computing resource demand becomes likewise great. Wth large numbers of audio inputs (i.e., greater than a few, e.g., 5), the computing resource demand may be impractically large.
In a beamforming-first approach, beamforming is performed on the plurality of input audio signals coming from the microphones to generate a single or a few beamformed audio signals. Acoustic echo cancellation is then performed on the beamformed audio signals, either individually or following a beam combination or mixing stage. This approach has the advantage that the computing resource demand does not significantly change according to the number of audio inputs. However, because the beamformer may change the direction and/or gain of the signal in the resulting beamformed audio signal, the acoustic echo cancellation operation may need to adapt to changing conditions (e.g., different echo paths) more often and more acutely than if a beamformer were not used.
Many audio conferencing systems are implemented with a beamforming-first approach and consequently suffer from the potential destabilization of acoustic echo cancellation due to changes driven by beamforming (rather than primarily by talkers). Hence, there is a need for systems and methods of combined beamforming and acoustic echo cancellation which overcome the limitations of prior systems.