Tracking in unconstrained and cluttered environments is a difficult task, especially when the object of interest is articulated. When a tracker fails, re-initialization is needed. In systems that are not self initialized, the initial structure from the manual initialization is assumed to be preserved over time for self correction. Even if an online model of the object is kept, it is not easy to determine automatically if the tracking has failed and when.
Tracking has been a well researched area in computer vision. There are numerous tracking systems such as Kanade-Lucas-Tomasi (KLT) Feature Tracker described in C. Tomasi and T. Kanade, Detection and tracking of point features, Technical Report CMU-CS-91-132, Carnegie Mellon University (April 1991) and mean shift approach described in D. Comaniciu and P. Meer, Mean shift: A robust approach toward feature space analysis, Pattern Analysis and Machine Intelligence, 24(5):603-619, (2002). The common problems of the prior art algorithms is initialization and self-correction.
Another problem with prior art detectors is that they have a high number of false positives (approximately 2 false positives per detection) and detection is not possible in some frames due to self occlusion and inability to detect the object of interest, such as finger primitives in low resolution images. Hand tracking is important in many applications including HCI, surveillance, gesture recognition and understanding human-human interactions. Prior art systems fail to track hands in uncontrolled environments, limiting them to simple gesture recognition problems with uniform background. In these systems, initialization usually comes from stationary hands in the system either by observing the hand for a short time frame, building pre-computed skin color histogram by training or manual initialization. Most prior art systems are unable to recover when the tracking fails.
Although some prior art describes methods for hand detection alone, such as the methods described in M. Kolsch and M. Turk. Robust hand detection. In Proc. Intl. Conf. Hand and Gesture Recognition, (2004) pp. 614-619 and Q. Yuan, S. Sclaroff, and V. Athitsos, Automatic 2d hand tracking in video sequences, In Proc. IEEE work, Motion and Video Computing, (2005) pp. 250 256, in most prior art systems the hand location is accepted as known or found through simple skin detection.
The skin detector is either trained on the skin color that is obtained from the face R. Rosales, V. Athitsos, L. Sigal, and S. Sclaroff. 3d hand pose reconstruction using specialized mappings, In Proc. Intl. Conf. Computer Vision, (2001), pp. 378-387, or as in the more frequent case, on skin color extracted from the training images as described in Bretzner et al, Hand gesture recognition using multi-scale color features, hierarchical models and particle filtering, In Proc. Intl. Conf. Hand and Gesture Recognition, pp. 423-428, (2002). However, the prior art systems fail under poor lighting conditions or when tracking skin colors that they are not trained for.
Aside from many systems that employ skin color based hand detection V. I. Pavlovic, R. Sharma, and T. Huang, Visual interpretation of hand gestures for human-computer interaction: A review. Pattern Analysis and Machine Intelligence, pp. 677-695, (1997); Rosales et al., 3d Hand pose reconstruction using specialized mappings, In. Proc. Intl. Conf. Computer Vision, (2001), pp. 378-387; and Y. Wu and T. Huang, View-independent recognition of hand postures, In Proc. Intl. Conf. Computer Vision and Pattern Recognition, (2000), pp. 2088-2094 there are few systems that operate in an appearance-based detection framework.
Kolsch and Turk discribed a system for detecting hands based on Ada-Boost. Their approach is view-specific and limited to a few postures of the hand. Yuan disclosed a hand detector, where detection of the hand successfully depends on continuous movement of the hands, therefore requires video as input. Moreover, in this video the hand posture is required to change, such that there is no correspondence between the hand regions among the frames that are being compared. Bretzner et al. use a hand detector before recognizing the hand posture; however their system performs poorly if a prior on skin color is not used.
Stenger et. al. applies hand detection and tracking in the same framework B. Stenger. Model-Based Hand Tracking Using A Hierarchical Bayesian Filter. PhD thesis, University of Cambridge, Cambridge, UK, (March 2004) and B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla. Hand pose estimation using hierarchical detection. In Proc. Intl. work. Human Computer Interaction, (2004) pp. 105-116, but the detection is limited to both specific skin color and a hand pose. Another method for tracking is accomplished by building a 3-D model of the hand as described by Lin et. al. Hand tracking using spatial gesture modeling and visual feedback for a virtual dj system. In Proc. Fourth intl. conf. Multimodal Interfaces, (2002), pp. 197-202, also applies detection and tracking, but skin color is used to extract the hand, thus reducing generality.
Moreover, the hand is assumed to extend from the right hand side of the image, and assumed to exist in the scene. They use “U” shaped fingertip patterns to detect fingertips in already detected hands. Another hand tracker to mention is disclosed by Rehg et. al., Visual tracking of high dof articulated structures: An application to human hand tracking, In Proceedings of the 3rd European Conference on Computer Vision (ECCV '94), volume II, (May 1994), pp. 35-46 where the authors describe methods and systems that track simple features of the hand instead of tracking the hand in its entirety. However, they use a simple background for tracking.