The human hand has 27 degrees of freedom (DoF): four in each finger, three for extension and flexion and one for abduction and adduction; the thumb is more complicated and has five DOF, leaving six DOF for the rotation and translation of the wrist. Capturing hand and finger motion in video sequences is a highly challenging task due to the large number of DoF of the hand kinematics. This process is even more complicated on hand-held smart devices due to the limited power and expensive computations.
Basically the common existing solutions follow the steps illustrated in FIG. 1. The query image sequence captured by sensor/s will be analyzed to segment user hand/fingers. Image analysis algorithms, such as background removal, classification, feature detection etc. are utilized to detect hand/fingers. In fact, existing algorithms of hand tracking and gesture recognition can be grouped into two categories: appearance based approaches and 3D hand model based approaches (US2010053151A1, US2010159981A1, WO2012135545A1, and US2012062558A1). The former are based on a direct comparison of hand gestures with 2D image features. The popular image features used to detect human gestures include hand colors and shapes, local hand features and so on. The drawback of feature-based approaches is that clean image segmentation is generally required in order to extract the hand features. This is not a trivial task when the background is cluttered, for instance. Furthermore, human hands are highly articulated. It is often difficult to find local hand features due to self-occlusion, and some kinds of heuristics are needed to handle the large variety of hand gestures. Instead of employing 2D image features to represent the hand directly, 3D hand model based approaches use a 3D kinematic hand model to render hand poses. An analysis-by-synthesis (ABS) strategy is employed to recover the hand motion parameters by aligning the appearance projected by the 3D hand model with the observed image from the camera. Generally, it is easier to achieve real-time performance with appearance-based approaches due to the fact of simpler 2D image features. However, this type of approaches can only handle simple hand gestures, like detection and tracking of fingertips. In contrast, 3D hand model based approaches offer a rich description that potentially allows a wide class of hand gestures. The main challenging problem is that 3D hand is a complex 27 DoF deformable model. To cover all the characteristic hand images under different views, a very large database is thus required. Matching the query images from the video input with all hand images in the database is time-consuming and computationally expensive. This is why most existing 3D hand model based approaches focus on real-time tracking for global hand motions with restricted lighting and background conditions.