1. Field of the Invention
The present invention relates in general to object detection and tracking, and in particular, to a system and method for visually tracking three-dimensional (3-D) articulated self occluded objects or multiple occluded objects, in real-time using dense disparity maps.
2. Related Art
Efficient and accurate tracking of non-rigid motion of three-dimensional (3-D) objects in image sequences is very desirable in the computer vision field. Non-rigid tracking is generally divided into studies of two categories of motion: deformable object motion and the motion of an articulated object. The latter is of great interest to the HCl (human computer interaction) community because the human body is an articulated object. Current commercially available motion capture systems are based on magnetic or optical trackers. These trackers usually require a human subject (the articulated object to be tracked) to wear a special suit with markers on it or require the human subject to be directly attached to the system by cumbersome cables. Therefore, what is needed is a passive sensing object tracking system that is convenient, less constraining, and suitable for a variety of object tracking uses.
Although there have been several vision-based approaches to human body tracking, ranging from detailed model-based approaches to simplified, but fast statistical algorithms and cardboard models, there is a need for an accurate and efficient system for real-time tracking of articulate objects. For example, in one previous system, xe2x80x9cPfinder: real-time tracking of the human bodyxe2x80x9d by C. Wren, A. Azarbayejani, T. Darrell, A. Pentland, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19, no.7, pp.780-5, July 1997, two-dimensional (2-D) tracking was achieved with Gaussian modeling of a set of pixels in the image sharing some kind of similarity. However, this system was limited due to the 2-D based Gaussian modeling. As such, what is also needed is an object tracking system that uses 3D Gaussian modeling. It should be noted that in this prior system and in most object tracking systems each set of pixel is commonly referred to as a xe2x80x9cblobxe2x80x9d.
In xe2x80x9cReal-time self-calibrating stereo person tracking using 3-D shape estimation from blob featuresxe2x80x9d by A. Azarbayejani, A. Pentland, Proceedings of the 13th International Conference on Pattern Recognition, vol. 3, pp.627-32, 1996, in order to track human motion in full 3-D, the above approach was extended by using input from two cameras for upper body tracking. However, in this system, only the hands and head were tracked while the position and orientation of the torso and lower and upper arms were ambiguous, and the two cameras used in this system were not used to calculate a dense disparity map, but rather to estimate 2-D blob parameters in each image. Thus, what is additionally needed is an articulated object tracking system that calculates disparity maps and 3-D object maps.
In xe2x80x9cDynamic models of human motionxe2x80x9d by C. Wren, A. Pentland, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp.22-7, 1998, an Extended Kalman Filter was used to impose articulation constraints on portions of the body to provide a guess about the full posture. However, since only three points are measured on a human body, there was not enough information for unambiguous posture tracking. Typically, knowledge of the dynamics of human motion is helpful for tracking, as discussed in the xe2x80x9cDynamic models of human motionxe2x80x9d reference. Therefore what is further needed is an object tracking system that can provide enough information for unambiguous posture tracking.
In xe2x80x9cModel-based tracking of self-occluding articulated objectsxe2x80x9d by J. Rehg and T. Kanade, Proceedings 1995 International Conference on Computer Vision, pp. 35-46, 1995, a model-based approach to tracking self-occluding articulated structures was proposed. However, the algorithm was based on template matching and was sensitive to lighting changes. Hence, what is also needed is a tracking system that uses stereo cues so that the disparity computed based on correlation is less sensitive to the intensity changes.
Whatever the merits of the above mentioned systems and methods, they do not achieve the benefits of the present invention.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention is embodied in a system and method for digitally tracking objects in real time. The present invention visually tracks three-dimensional (3-D) objects in real-time using dense disparity maps.
The present invention digitally tracks articulated objects, such as a human body, by digitally segmenting and modeling different body parts using statistical models defined by multiple size parameters, position and orientation. In addition, the present invention is embodied in a system and method for recognizing mutual occlusions of body parts and filling in data for the occluded parts while tracking a human body. The body parts are preferably tracked from frame to frame in image sequences as an articulated structure in which the body parts are connected at the joints instead of as individual objects moving and changing shape and orientation freely.
Specifically, the present invention uses input from multiple cameras suitably spaced apart. A disparity map is computed at a predefined frame rate with well known and readily available techniques, such as the technique described in xe2x80x9cBackground modeling for segmentation of video-rate stereo sequencesxe2x80x9d by C. Eveland, K. Konolige, R. Bolles, Proceedings 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.266-71, 1998, which is herein incorporated by reference. The present invention uses this information based on a generative statistical model of image formation specifically tuned to fast tracking in the presence of self-occlusions among the articulated 3-D Gaussian models.
This model is a graphical model, such as a Bayesian network, that can be used for diverse applications at formalizing generative processes in ways that allow probabilistic inference. Preferably, a maximum likelihood estimate of the posture of an articulated structure is achieved by a simplified, but very fast inference technique that consists of two stages. In the first stage, the disparity map is segmented into different parts of the articulated structure based on the estimated state of the Gaussian mixture using the maximum likelihood principle with an additional mechanism for filling in the missing data due to occlusions. The statistical properties of the individual parts are then re-estimated. In the second stage of the inference technique of the present invention, an extended Kalman Filter (EKF) enforces the articulation constraints and can also improve the tracking performance by modeling the dynamics of the tracked object. Also, it should be noted that the present invention can be simplified and used to track independent self occluding objects without articulation constraints.