Numerous applications in the field of computer-human interaction (CHI) have begun to incorporate new modes of interaction with human users that go beyond the well-known keyboard and mouse input device interface. In particular, many virtual reality (VR) and augmented reality (AR) applications provide an interface for a human user to provide multiple forms of input including hand gestures. Hand gesture inputs include a wide range of movements of the hands including both linear and rotational hand movements along with movement of individual fingers in the hand. Earlier input systems received two-dimensional gestures using touch interfaces to track hand and finger movements in two dimensions or required the use of instrumented gloves that included complex sensors that directly measured the pose of the hand in three-dimensions. However, newer input device technologies including three-dimensional depth cameras now enable the generation of three dimensional depth map data for a hand of a user that is moving freely in a three-dimensional space without the requirement for the user to wear the glove input devices, which enables a far greater range of gesture movements that serve as inputs to a computing device.
One component in processing hand gestures in a three-dimensional space is to identify the pose of the hand as the hand moves through various positions in an input sequence. The “pose” refers to an angular orientation and shape of the hand that is affected by the movements of the muscles and the bones in the skeleton of the hand as the user moves the hand. For example, the pose is affected by the rotation of the wrist, the shape of the palm, and the positions of the fingers on the hand. Existing techniques for tracking hand poses in three-dimensions extend traditional two-dimensional (2D) image processing techniques into a three-dimensional (3D) space for hand tracking that is currently the norm. However, these techniques neglect critical affordances provided by the depth sensing camera. First, the deep learning which is the current state-of-the-art for 2D image classification is directly adapted for 3D regression and hence, loses structural information and oblivious to articulation constraints of the hand and fingers. Second, the latent information contained in the frequently used, temporally near and similar hand poses to an input depth map is lost by optimizing a single-objective function in the hand fitting module. Third, the metrics used to assess the fidelity of the hand tracking system are not focused on interactive applications as desired by the CHI community. Furthermore, the machine learning techniques are tailored to the specific error metrics, and do not focus on the bigger picture of developing a robust hand tracking method for the next generation of computer human systems (CHS). Consequently, improvements to processes and systems that perform computer-human interaction to improve the speed and accuracy of the recognition of hand poses in three-dimensional space would be beneficial.