Videoconferencing, resides in a middleground between face to face, in person meetings, and telephone calls. Commonly, the cameras for a videoconferencing system often have mechanical pan, tilt, and zoom control. Ideally, these controls should be continuously adjusted to achieve optimal video framing of the people in the room based on where they are seated and who is talking. Unfortunately, due to the difficulty of performing these adjustments, the camera may often be set to a fixed, wide-angle view of the entire room and may not be adjusted. If this is the case, far-end participants may lose much of the value from the video captured by the camera because the size of the near-end participants displayed at the far-end may be too small. In some cases, the far-end participants cannot see the facial expressions of the near-end participants, and may have difficulty identifying speakers. These problems can give a videoconference an awkward feel and make it hard for the participants to have a productive meeting.
To deal with poor framing, participants may have to intervene and perform a series of manual operations to pan, tilt, and zoom the camera to capture a better view. However, manually directing a camera can be cumbersome, even when a remote control is used. Sometimes, participants do not bother adjusting the camera's view and simply use the default wide view. Of course, when a participant does manually frame the camera's view, the procedure has to be repeated if participants change positions during the videoconference or use a different seating arrangement in a subsequent videoconference.
An alternative to manual intervention is to use sound source location technology to control camera direction. An example of sound source location technology is voice-tracking technology. Voice-tracking cameras having microphone arrays can help direct the cameras during the videoconference toward participants who are speaking. A participant can be framed within a zoomed-view. The video captured by the near end camera(s) is included in a video stream. The framed images of participants within the video stream are sent to a far end can change throughout a meeting. Pickup microphones at the near endpoint capture audio. Each microphone generates an audio stream (output signal) which corresponds to audio captured by that microphone. Conventional methods of combining such video stream(s) with such audio stream(s) are unsatisfactory. There is thus room for improvement in the art.