Multi-endpoint videoconferencing allows participants from multiple locations to collaborate in a meeting. For example, participants from multiple geographic locations can join a meeting and communicate with each other to discuss issues, share ideas, etc. These collaborative meetings often include a videoconference system with two-way audio-video transmissions. Thus, virtual meetings using a videoconference system can simulate in-person interactions between people.
However, videoconferencing consumes a large amount of both computational and bandwidth resources. In order to conserve those resources, many videoconferencing systems devote resources depending on how much the videoconference needs to use each video source. For example, the videoconference system will expend more resources for a participant who is actively speaking than a participant who is listening or not directly engaged in the conversation, oftentimes by using low resolution video for the non-speaking participant and high resolution video for the actively speaking participant. When the participant who is speaking changes, the videoconferencing server will switch from the first speaker to the current speaker's video source, and/or will increase the prominence of the new speaker in the videoconference display.
However, current methods of speaker detection and video switching are slow and depend on detecting a participant who is already speaking. For example, attention delay due to the time for processing the active speakers, confusion in audio sources (e.g., mistakenly identifying a closing door or voices from another room as a speaking participant), and/or not picking up on other cues (e.g., the speaker pauses to draw on a whiteboard) are common problems. Thus, there is a need to improve the accuracy and speed of in-room speaker detection and switching.