The cameras for a videoconferencing system often has mechanical pan, tilt, and zoom control. Ideally, these controls should be continuously adjusted to achieve optimal video framing of the people in the room based on where they are seated and who is talking. Unfortunately, due to the difficulty of performing these adjustments, the camera may often be set to a fixed, wide-angle view of the entire room and may not be adjusted. If this is the case, far-end participants may lose much of the value from the video captured by the camera because the size of the near-end participants displayed at the far-end may be too small. In some cases, the far-end participants cannot see the facial expressions of the near-end participants, and may have difficulty identifying speakers. These problems give the videoconference an awkward feel and make it hard for the participants to have a productive meeting.
To deal with poor framing, participants may have to intervene and perform a series of manual operations to pan, tilt, and zoom the camera to capture a better view. As expected, manually directing the camera can be cumbersome even when a remote control is used. Sometimes, participants do not bother adjusting the camera's view and simply use the default wide view. Of course, when a participant does manually frame the camera's view, the procedure has to be repeated if participants change positions during the videoconference or use a different seating arrangement in a subsequent videoconference.
An alternative to manual intervention is to use voice-tracking technology. Voice-tracking cameras having microphone arrays can help direct the cameras during the videoconference toward participants who are speaking. Although the voice-tracking camera is usually very accurate in it can still encounter some problems. When a speaker turns away from the microphones, for example, the voice-tracking camera may lose track of the speaker. Additionally, a very reverberant environment can cause the voice-tracking camera to direct at a reflection point rather than at an actual sound source of a person speaking. For example, typical reflections can be produced when the speaker turns away from the camera or when the speaker sits at an end of a table. If the reflections are troublesome enough, the voice-tracking camera may be guided to point to a wall, a table, or other surface instead of the actual speaker.
An excellent earlier solution to these issues is set forth in U.S. Pat. No. 8,842,161 to Jinwei Feng et al. That patent discloses a videoconference apparatus and method which coordinates a stationary view obtained with a stationary camera to an adjustable view obtained with an adjustable camera. The stationary camera can be a web camera, while the adjustable camera can be a pan-tilt-zoom camera. As the stationary camera obtains video, faces of participants are detected, and a boundary in the view is determined to contain the detected faces. Absence and presence of motion associated with the detected face is used to verify whether a face is reliable. In Jinwei, in order to capture and output video of the participants for the videoconference, the view of the adjustable camera is adjusted to a framed view based on the determined boundary. U.S. Pat. No. 8,842,161 combined the technology of sound source location (SSL), participant detection and motion detection to locate the meeting attendees and decide what the optimal view would be, based on the location information, and then control the adjunct pan-tilt-zoom (PTZ) camera to pan, tilt and zoom to get the desired view.
Due to the popularity of the videoconference apparatuses such as those disclosed in U.S. Pat. No. 8,842,161, it has become popular to extend the range of such apparatuses by connecting two such devices, with one controlling the other. This has meant that oftentimes two views of a meeting presenter will be captured, one by each of the adjustable cameras. The issue then becomes, how to ensure that the better view is selected for transmission to a remote endpoint.