In an audio and video conferencing system with multiple participants attending from different locations, the capability and the accuracy of detecting an active speaker can substantially improve user audio experience, especially when there is undesirable background noise from participants who are not actively engaged in conversation. For example, if a participant joins a conference call from a noisy environment, undesirable noise will be sent to all the participants and could degrade performance or even make the intelligibility of the conferencing call impossible. In this case, if the system could reliably detect the participant as non-active, it could either automatically mute the participant or send a message to the participant regarding the undesirable background noise.
Traditionally, active speaker detection is performed by detecting the energy level and voice activity (VAD) in speech signals. In a controlled environment with stationary and foreseeable background noise, the traditional way to detect an active speaker yields reasonably good performance. However, in live calls, the background noise is rarely stationary or foreseeable. An additional problem is that in a two-way audio conferencing system, echo is commonly detected as an active speaker. Thus, there is a need to accurately determine whether speaker detection is contaminated by background noise and/or echo in a two-way audio system.