Several video tracking systems are well known in the art. However, video tracking systems heretofore known, lack many of the functional, performance and robustness capabilities as the present invention.
The method of Harakawa, U.S. Pat. No. 6,434,255, also utilizes two video sensors, but requires specialized infrared cameras. Furthermore, additional hardware is required to provide infrared illumination of the user. Finally, the system needs a large mechanized calibration apparatus that involves moving a large marking plate through the space that is later occupied by the user. During the calibration procedure, the movement of the plate has to be precisely controlled by the computer.
The method of Hildreth et. al, International Patent. WO 02/07839 A2, determines the 3D locations of objects in the view of cameras by first extracting salient features from each image and then to pair up these two sets to find points in each of the two images that correspond to the same point in space. It is well known in the art, that this feature matching approach takes a lot of computational resources and that it easily fails in situations where no or very few clean feature sets can be extracted, where occlusion prevents pairing of a feature in one image with a point in the second image. It is common to pair two features that do not correspond to the same location in space, yielding an entirely incorrect 3D location estimate. Finally, their method requires additional processing based on the stereo information calculated to determine the actual location of the object to be tracked, with many more computational steps as the present invention.
The method of Darrell et. al, US 2001/0000025 A1, is also based on two cameras but also requires the calculation of a disparity image, which is faced with exactly the same challenges as the above described method of Hildreth et. al.
The methods of Bradski, U.S. Pat. No. 6,394,557 and U.S. Pat. No. 6,363,160, are based on using color information to track the head or hand of a person in the view of a single camera. The use of a single camera does not yield any 3D coordinates of the objects that are being tracked. Furthermore, it is well known, that the use of only color information and a single camera in general is insufficient to track small, fast moving objects in cluttered environment, their method is hence much less general and only workable in certain specialized environments. In particular, their method will fail, if for example the user holds his hand in front of his face.
The method of Crabtree et. al, U.S. Pat. No. 6,263,088, is also based on a single camera and designed to track people in a room seen from above. The use of a single camera does not yield any 3D coordinates of the objects that are being tracked.
The method of Jolly et. al, U.S. Pat. No. 6,259,802, is also based on a single camera and requires a means to extract and process contour information from an image. Contour extraction is both time consuming and prone to error
The method of Qian et. al, U.S. Pat. No. 6,404,900, is designed to track human faces in the presence of multiple people. The method is also based on a single camera, yielding no 3D information, utilizes only color information and is highly specialized to head tracking, making it unsuitable for alternative application domains and targets.
The method of Sun et. al, U.S. Pat. No. 6,272,250, is also based on a single camera or video and requires an elaborate color clustering approach, making their method computationally expensive and not suitable for tracking general targets in 3D.
The method of Moeslund T. B., et al., 4th IEEE Int. Conf. Automatic Face and Gesture Rec., 2000, p. 362-367, utilizes color segmentation of the hand and the head in two cameras. This approach fails if the segments of head and hand come too close to each other.
The methods of Goncalves L., et al., Proc. International Conference on Computer Vision, 1995, p. 764-770, and Filova V., et al., Machine Vision and Application, 1998, 10: p. 223-231, perform model based tracking of a human arm in a single camera view. This approach obtains 3D information even in a single camera image, however, model based tracking as described in their paper is computationally extremely expensive and not suitable for practical application. Furthermore, the operating conditions are very constrained requiring the person whose arm is tracked to assume a very specific pose with respect to the camera.
The method of Wu A., et al., 4th IEEE Int. Conf. Automatic Face and Gesture Rec., 2000, p. 536-542, is also a model based approach and requires the detection of a users elbow and shoulder, which is difficult to perform outside of very constrained environments. More specifically, their method is based on skin color cues and implicitly assumes that the user, whose arm is being tracked, wears short-sleeved shirts, thus very much limiting the domain in which their method would be useful.
The method of Ahmad S., A Usable Real-Time 3D Hand Tracker, IEEE Asian Conference, 1994, is able to track a human hand held between a camera and a table, where the camera is pointed at the table with the imaging sensor parallel to the table surface. Their method is very specific in that it is only usable in a situation where the user, whose hand is being tracked, is sitting at a table with his hand at a particular location held in a particular pose, and thus lacks generality.