In order to track a real-world object, it has long been proposed to use data processing devices connected to imaging devices and programmed so as to track the object in a video sequence produced by the imaging device and comprising a sequence of successive frames, each frame comprising a pixel array.
For instance, the article “Tracking by Cluster Analysis of Feature Points using a Mixture Particle Filter”, by Wei Du and Justus Piater, disclosed a method for tracking an object in a video sequence, using the Harris corner detector and the Lucas-Kanade tracker. However, since this method is applied on a bidimensional video sequence without pixel depth information, its performance is limited despite considerable data processing requirements.
Some other relevant papers disclosing methods for tracking one or several objects in video sequences with bidimensional pixel arrays are:
S. McKenna, S. Jabri, Z. Duric and H. Wechsler, “Tracking Groups of People”, Computer Vision and Image Understanding, 2000.
F. Brémond and M. Thonnat, “Tracking multiple nonrigid objects in video sequences”, IEEE Trans. On Circuits and Systems for Video Techniques, 1998.
I Haritaoglu, “A Real Time System for Detection and Tracking of People and Recognizing Their Activities”, University of Maryland, 1998.
G. Pingali, Y. Jean and A. Opalach, “Ball Tracking and Virtual Replays for Innovative Tennis Broadcasts”, 15th Int. Conference on Pattern Recognition.
However, since these tracking methods are carried out on 2D video sequences without any direct pixel depth information, their performance is necessarily limited, since image segmentation can only be based on other object attributes such as colour, shape or texture.
It has already been proposed, for instance in International Patent Application WO2008/128568, to use 3D imaging systems providing video sequences wherein a depth value is associated to each pixel of each frame. Such a tracking method generates more and more useful positional information about a tracked object than one based on purely two-dimensional images. In particular, the use of 3D imaging systems facilitates the discrimination between foreground and background. However, the disclosed method does not address the problem of tracking more than one object, and in particular that of tracking an object at least partially occluded by another object in the field of view of the 3D imaging system. In WO 2008/128568, a method for recognising a volume within three-dimensional space is disclosed in which three-dimensional image data comprises a plurality of points within the three-dimensional space. These points are clustered and a cluster is selected as a point of interest. The points within the selected cluster are re-grouped into sub-clusters, each of which having a centroid and a volume associated with the centroid. Centroids can be connected to form a network indicative of an object and the extremities are identified as being a centroid that is connected to only one other centroid.
Other tracking methods using 3D video sequences, but which fail to address the occlusion problem have been disclosed by A. Azerbayerjani and C. Wren in “Real-Time 3D Tracking of the Human Body”, Proc. of Image'com, 1996; and by T. Olson and F. Brill in “Moving Object Detection and Event Recognition Algorithms For Smart Cameras”, Proc. Image Understanding Workshop, 1997.
A number of other disclosures have addressed this occlusion problem. A number of various methods has been presented by Pierre F. Gabriel, Jacques G. Verly, Justus H. Piater, and André Genon of the Department of Electrical Engineering and Computer Science of the University of Liège in their review “The State of the Art in Multiple Object Tracking Under Occlusion in Video Sequences”.
A. Elgammal and L. S. Davis in “Probabilistic framework for segmenting people under occlusion”, Proc. of IEEE 8th International Conference on Computer Vision, 2001; I. Haritaoglu, D. Harwood and L. Davis in “Hydra: Multiple People Detection and Tracking”, Workshop of Video Surveillance, 1999; S. Khan and M. Shah in “Tracking People in Presence of Occlusion”, Asian Conference on Computer Vision”, 2000; H. K. Roh and S. W. Lee in “Multiple People Tracking Using an Appearance Model Based on Temporal Color”, International Conference on Pattern Recognition, 2000; and A. W. Senior, A. Hampapur, L. M. Brown, Y. Tian, S. Pankanti and R. M. Bolle in “Appearance Models for Occlusion Handling”, 2nd International Workshop on Performance Evaluation of Tracking and Surveillance Systems”, 2001 have disclosed tracking methods addressing this occlusion problem. However, as all these methods are based on 2D or stereo video sequences comprising only bidimensional pixel arrays without any depth data, their performance is limited.
A. F. Bobick et al in “The KidsRoom: A perceptually based interactive and immersive story environment”, Teleoperators and Virtual Environment, 1999; R. T. Collins, A. J. Lipton, and T. Kanade in “A System for Video Surveillance and Monitoring”, Proc. 8th International Topical Meeting on Robotics and Remote Systems, 1999; W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee in “Using adaptive tracking to classify and monitor activities in a site”, Computer Society Conference on Computer Vision and Pattern Recognition; as well as A. Bevilacqua, L. Di Stefano and P. Tazzari in <<People tracking using a time-of-flight depth sensor>>, IEEE International Conference on Video and Signal Based Surveillance, 2006, disclosed object tracking methods based on a top-down scene view. However, as a result, the information available over the tracked object, in particular when it is a human user, is limited.
Dan Witzner Hansen, Mads Syska Hansen, Martin Kirschmeyer, Rasmus Larsen, and Davide Silvestre, in “Cluster tracking with time-of-flight cameras”, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, disclosed an object tracking method in which the objects are also tracked in a homographic plane, i.e. in a “top-down” view. This method uses an Expectation Maximisation algorithm. However, it is also insufficiently adapted for gesture recognition if the tracked objects are human users.
Leila Sabeti, Ehsan Parvizi and Q. M. Jonathan Wu also presented an object tracking method using a 3D video sequence with pixel depth data in “Visual Tracking Using Colour Cameras and Time-of-Flight Range Imaging Sensors”, Journal of Multimedia, Vol. 3, No. 2, June 2008. However, this method, which uses a Monte-Carlo-based “particle filter” tracking method, also requires considerable data processing resources.
US 2006/239558 discloses a three-dimensional imaging system that produces an image of a scene. Pixels within the image of the scene are labelled according to which object in the scene they are related to, and are assigned with a value. Groups of pixels having the same label are grouped to form “blobs”, each blob corresponding to a different object. Once the blobs are defined, they are modelled or quantised into variously shaped primitives, such as, circles or rectangles etc. or other predefined objects, such as a person, an animal or a vehicle. Clustering of pixels in the scene and their associated depth values are used to determine whether a pixel belongs to a particular cluster in accordance with its depth value. If the pixel is at the same depth as a neighbouring pixel, it therefore assigned the same label as the cluster to which the neighbouring pixel belongs.
U.S. Pat. No. 6,771,818 discloses a method for identifying and locating people and objects of interest in a scene by selectively clustering distinct three-dimensional regions or “blobs” within the scene and comparing the “blob” clusters to a model for object recognition. An initial three-dimensional depth image of a scene of interest is generated. The spatial coordinates of three-dimensional image pixels within the three-dimensional volume represented by the image. The identification and location of people or objects is determined by processing a working image obtained from a background subtraction process using the initial three-dimensional depth image and a live depth image so that any pixel in the live depth image that differs significantly from the initial three-dimensional depth image becomes part of the working image that contains a number of distinct three-dimensional regions or “blobs”. The “blobs” are processed to identify to which person or object each of the blobs belongs.