The present invention relates generally to the field of video signal processing, and more particularly to techniques for identifying the location of persons or other objects of interest using a video camera such that a desired video output can be achieved.
The tracking of a person or another object of interest in an image is an important aspect of video-camera-based systems such as video conferencing systems and video surveillance systems. For example, in a video conferencing system, it is often desirable to frame the head and shoulders of a particular conference participant in the resultant output video signal.
Video conferencing systems often utilize a pan-tilt-zoom (PTZ) camera to track an object of interest. The PTZ camera allows the system to position and optically zoom the camera to perform the tracking task. A problem with this approach is that, in some cases, the tracking mechanism is not sufficiently robust to adapt to sudden changes in the position of the object of interest. This may be due to the fact that the camera is often being zoomed-in too far to react to the sudden changes. For example, it is not uncommon in a video conferencing system for participants to move within their seats, for example, to lean forward or backward, or to one side or the other. If a PTZ camera is zoomed-in too far on a particular participant, a relatively small movement of the participant may cause the PTZ camera to lose track of that participant, necessitating zoom-out and re-track operations that will be distracting to a viewer of the resultant output video signal.
Initially, control systems for PTZ cameras in a video conferencing system required an operator to make manual adjustments to the camera to maintain the focus on the current speaker. Increasingly, however, users of video conferencing systems demand hands-free operation, where the control of the PTZ camera must be fully automatic. A number of techniques have been proposed or suggested for automatically detecting a person based on audio and video information. An audio locator processes audio information obtained from an array of microphones and determines the position of a speaker. Specifically, when the relative microphone positions are known, the position of the sound source can be determined from the estimated propagation time differences of sound waves from a single source using well-known triangulation techniques.
Similarly, a video locator locates one or more objects of interest in a video image. In the context of a video conferencing system, the objects of interest are the head and shoulders of the speakers. The video locator frames the head and shoulders of the speaker using information about the head size and location of the speaker in the image. A number of well-known techniques are available for detecting the location of a person in an image, including skin tone detection, face detection and background subtraction. For a more detailed discussion of these techniques for detecting the location of a person in an image, see, for example, xe2x80x9cFace Recognition: From Theory to Applicationsxe2x80x9d (NATO ASI Series, Springer Verlag, New York, H. Wechsler et al., editors, 1998), incorporated by reference herein.
A need therefore exists for an improved technique that can detect persons in image processing systems, such as video conferencing systems. A further need exists for methods and apparatus for detecting persons in such image processing systems with a reduced computational load.
Generally, methods and apparatus are disclosed for tracking an object of interest in a video processing system, using clustering techniques. Specifically, the present invention partitions an area into an approximate region, referred to as a cluster, that are each associated with an object of interest. Each cluster has associated with it average pan, tilt and zoom values. In an illustrative video conference implementation, audio or video information, or both, are used to identify the cluster associated with a speaker. Once the cluster of the speaker is identified, the camera is focused on the cluster, using the recorded pan, tilt and zoom values, if available.
In one implementation, an event accumulator initially accumulates audio (and optionally video) events for a specified time, such as approximately 3 to 5 seconds, to allow several speakers to speak. The accumulated audio events are then used by a cluster generator to generate clusters associated with the various objects of interest. The illustrative cluster generator utilizes two stages, namely, an unsupervised clustering stage, such as a subtractive clustering technique, and a supervised clustering stage, such as an iterative optimization-based clustering technique (i.e., K-means clustering). Once the initial clusters are formed, they are then indexed into a position history database with the pan and tilt values for each cluster, as well as the zoom factor, if available, equal to the corresponding cluster mean pan, tilt and zoom values.
After initialization of the clusters, the illustrative event accumulator gathers events at periodic intervals, such as every 2 seconds. The mean of the pan and tilt values (and zoom value, if available) occurring in each time interval are then used to compute the distance (e.g., Euclidean distance) between the various clusters in the database by a similarity estimator, based on an empirically-set threshold. If the distance is greater than the established threshold, then a new cluster is formed, corresponding to a new speaker, and indexed into the database. Otherwise, the camera is focused on the identified cluster.
In a further variation, fuzzy clustering techniques are employed to focus the camera on more than one cluster at a given time, when the object of interest may be located in one or more clusters. Generally, a membership value is assigned to each cluster that indicates the likelihood that a given data point belongs to the cluster. If the membership value does not clearly suggest a particular cluster, then the camera may be simultaneously focused on the plurality of clusters with the highest membership values.