Video conferencing has become increasingly effective in order to facilitate discussion among physically remote participants. A video input device, such as a camera, generally provides the video input signal portion of a video conference. Many conventional systems employ an operator to manually operate (e.g., move) the video input device.
Other systems employ a tracking system to facilitate tracking of speakers. However, in many conventional system(s) that process digital media, audio and video data are generally treated separately. Such systems usually have subsystems that are specialized for the different modalities and are optimized for each modality separately. Combining the two modalities is performed at a higher level. This process generally requires scenario dependent treatment, including precise and often manual calibration. A tracker using only video data may mistake the background for the object or lose the object altogether due to occlusion. Further, a tracker using only audio data can lose the object as it stops emitting sound or is masked by background noise.