The camera for a videoconferencing system often has mechanical pan, tilt, and zoom control. Ideally, these controls should be continuously adjusted to achieve optimal video framing of the people in the room based on where they are seated and who is talking. Unfortunately, due to the difficulty of performing these adjustments, the camera may often be set to a fixed, wide-angle view of the entire room and may not be adjusted. If this is the case, far-end participants may lose much of the value from the video captured by the camera because the size of the near-end participants displayed at the far-end may be too small. In some cases, the far-end participants cannot see the facial expressions of the near-end participants, and may have difficulty identifying speakers. These problems give the videoconference an awkward feel and make it hard for the participants to have a productive meeting.
To deal with poor framing, participants may have to intervene and perform a series of manual operations to pan, tilt, and zoom the camera to capture a better view. As expected, manually directing the camera can be cubersome even when a remote control is used. Sometimes, participants do not bother adjusting the camera's view and simply use the default wide view. Of course, when a participant does manually frame the camera's view, the procedure has to be repeated if participants change positions during the videoconference or use a different seating arrangement in a subsequent videoconference.
Voice-tracking cameras having microphone arrays can help direct the camera during the videoconference toward participants who are speaking. Although the voice-tracking camera is very useful, it can still encounter some problems. When a speaker turns away from the microphones, for example, the voice-tracking camera may lose track of the speaker. Additionally, a very reverberant environment can cause the voice-tracking camera to direct at a reflection point rather than at an actual sound source of a person speaking. For example, typical reflections can be produced when the speaker turns away from the camera or when the speaker sits at an end of a table. If the reflections are troublesome enough, the voice-tracking camera may be guided to point to a wall, a table, or other surface instead of the actual speaker.
One solution to the problem of directing a camera during a videoconference is disclosed in U.S. Pat. No. 6,894,714 to Gutta et al., which discloses an apparatus and methods which use acoustic and visual cues to predict when a participant is going to speak or stop speaking. As shown in FIG. 1, an adaptive position locator 30 of Gutta includes a wide-angle camera 20, a microphone array 22, and a pan-tilt-zoom camera 34. During a videoconference, the locator 30 processes audio and video to locate a speaker.
To do this locating, the wide-angle camera 20 and the microphone array 22 generate signals at initial startup. The signals from the wide-angle camera 20 pass to a face recognition module 32, which has a face detector to determine whether or not a given region of interest (window) can be labeled as a face region so a unique identifier can be assigned to a given face. Likewise, signals from the microphone array 22 pass to a speaker identification module 33 and an audio locator 36, which obtains directional information that identifies pan and tilt angles associated with a participant who is speaking.
Then, the images from the wide-angle camera 20 along with the results of face recognition and their locations are stored in a frame buffer 39 along with the audio signals from the microphone array 22 and the results of the speaker identification. The audio and video signals are accumulated for a predefined interval, and a motion detector 35 detects motion in the video frames occurring during this interval. In the end, a space transformation module 37 receives position information from the motion detector module 35 and directional information from the audio locator 36 and then maps the position and direction information to compute a bounding box used to focus the PTZ camera 34.
At this point, a predictive speaker identifier 40 identifies one or more acoustic and visual cues to predict the next speaker. In particular, the predictive speaker identifier 40 processes the video from the PTZ camera 34 and the contents of the frame buffer 39 and speaker identification module 33. As noted above, the contents of the frame buffer 39 include the wide-angle images from the wide-angle camera 34 and the corresponding face recognition results, the audio signals from the microphone array 22, and the corresponding speaker identification results. Based on this information, the predictive speaker identifier 40 can identify the visual and acoustic cues of each non-speaking participant from the wide-angle image and audio signals. Ultimately, the speaker predictions generated by the predictive speaker identifier 40 are used to focus the PTZ camera 34 at the next predicted speaker.
As can be seen above, systems that use voice tracking and participant detection may require complex processing and hardware to control a camera during a videoconference. Moreover, such systems can have practical limitations. For example, such systems may require an operator to manually initiate the automated operation by pressing a button. This is the case because such systems require a sufficient period of time for training to operate properly. For example, such a system has to work in a training mode first and then has to switch to an active mode, such as a predictive mode to predict who will speak. The switching from training mode to active mode requires the manual user intervention.
Yet, requiring manual initiation of the automated functions can cause problems when people walk in or out of a room during a meeting. Additionally, for the automated control of the camera to operate properly, all of the participants need to face the camera. For example, the automated control of the camera fails when a participant turns his head away from the camera, which can happen quite often in a video conference.
Another solution is set forth in U.S. Pat. No. 8,842,161 to Jinwei Feng et al. That patent discloses a videoconference apparatus and method which coordinates a stationary view obtained with a stationary camera to an adjustable view obtained with an adjustable camera. The stationary camera can be a web camera, while the adjustable camera can be a pan-tilt-zoom camera. As the stationary camera obtains video, faces of participants are detected, and a boundary in the view is determined to contain the detected faces. Absence and presence of motion associated with the detected face is used to verify whether a face is reliable. In Jinwei, in order to capture and output video of the participants for the videoconference, the view of the adjustable camera is adjusted to a framed view based on the determined boundary. Jinwei combined the technology of sound source location (SSL), participant detection and motion detection to locate the meeting attendees and decide what the optimal view would be, based on the location information, and then control the adjunct pan-tilt-zoom (PTZ) camera to pan, tilt and zoom to get the desired view. The methods set forth in Jinwei work very well in most videoconferencing situations. However, there are certain situations in which these methods may underperform. There is thus room for improvement in the art.