Systems that analyze video data are becoming increasingly popular. Video conferencing systems are examples of such systems—they allow for meetings to occur with visual interaction despite the fact that meeting participants may be located in different geographic locations. The visual aspect of video conferencing makes it typically more appealing than telephone conferences, while at the same time being a lower-cost alternative to (and typically can occur on shorter notice than) in-person meetings when one or more participants must travel to the meeting location.
Some current video conferencing systems use automated audio-based detection techniques and/or presets to move the camera (e.g., pan or tilt the camera). However, many problems exist with current video conferencing systems. One such problem is that the accuracy of audio-based speaker detection technique can be low. Additionally, the video conferencing system typically does not know how many participants there are in the meeting (including when participants join or leave the meeting), where the participants are located (sitting or standing), or which participant is currently talking. While some systems may be manually programmed with participant information (e.g., the number of participants and their locations), this requires user-entry of the information being programmed, which tends to restrict participants' ability to move about the room, as well as the ability of participants to join the conference.
The automatic detection and tracking of multiple individuals described herein helps solve these and other problems.