Detection and tracking of a person or other object of interest is an important aspect of video-camera-based systems such as video conferencing systems, video surveillance and monitoring systems, and human-machine interfaces. For example, in a video conferencing system, it is often desirable to frame the head and shoulders of a particular conference participant in the resultant output video signal.
A conventional boardroom-type video conferencing system will typically include a pan-tilt-zoom (PTZ) camera mounted on top of a monitor. The PTZ camera may be operated via an infrared remote control by one of the participants, that participant being designated as a de facto cameraman, or by a non-participant cameraman. The cameraman generally tries to control the pan, tilt and zoom settings of the camera so as keep the current speaker in view, and sufficiently in close-up, such that participants at the remote receiving end can see the speaker's facial expressions. When the speaker gets up, writes on a whiteboard, or points at an object, the cameraman has to follow the speaker's movements accordingly. In some cases, the cameraman may have to react to explicit commands issued by the speaker, such as “Zoom in more.”
However, even for a human cameraman, it is not always easy to produce a satisfying video conference experience, as the conference is a live event without a script. The cameraman has to react to unexpected movements or commands by the speaker, and to interruptions and short utterances of other participants often outside his field of vision. The cameraman's reactions to the situation largely determine the quality of the video conference experience for the remote participants, i.e., determine whether the remote participants see the correct persons on their monitor, at the correct time and with the correct zoom, and determine whether the movement of the picture is distracting, disorienting or shows excessive artifacts.
The pattern of movement of the camera can also have an effect on the local participants. For example, the local participants might attribute a “personality” to the camera, such as dominant, nervous, attentive, etc.
These and other factors make it difficult for a human cameraman to provide the requisite tracking function in a video conferencing system.
A number of techniques are known in the art for providing automated tracking of speakers or other objects in a video conferencing system. For example, U.S. Pat. No. 6,005,610 issued Dec. 21, 1999 to S. Pingali describes an audio-visual object localization and tracking system in which audio and video information are combined to implement a tracking function. Another audio-video tracking system known in the art is the PictureTel SwiftSite-II set-top video conferencing system, as described in A. W. Davis, “Image Recognition and Video Conferencing: A New Role for Vision in Interactive Media?,” Advanced Imaging, pp. 30-32, February 2000. A problem with these and other conventional techniques is that they generally fail to combine the audio and video information in a manner which avoids unnecessary or awkward camera movements to the greatest extent possible.
A need therefore exists for improved techniques for efficiently automating the tracking process in video conferencing and other applications, so as to free a participant or other human cameraman from this task, without degrading the quality of the resulting video conference.