In the field of Human-computer interaction (HCI), i.e., the study of the interfaces between people (i.e., users) and computers, understanding the intentions and desires of how the user wishes to interact with the computer is a very important problem. When handled properly, HCI enables user-friendly interactions, e.g., via multi-modal inputs, such as: voice, touch, body gestures, graphical user interfaces (GUIs), and other input peripherals, such as keyboard, mice, styluses, etc.
The ability to understand human gestures, and, in particular, hand gestures, as they relate to HCI, is a very important aspect in understanding the intentions and desires of the user in a wide variety of applications. In this disclosure, a novel system and method for three-dimensional hand tracking is described.
Existing hand tracking applications typically rely on “depth maps” in some fashion. A number of different methods and systems are known in the art for creating depth maps, some of which are described, e.g., in the commonly-assigned U.S. Pat. No. 8,582,867 (“the '867 patent”), which is hereby incorporated by reference in its entirety. In the present patent application, the term “depth map” will be used to refer to the representation of a scene as a two-dimensional matrix of pixels, in which each pixel corresponds to a respective location in the scene and has a respective pixel depth value, indicative of the distance from a certain reference location to the respective scene location. In other words, the depth map has the form of an image in which the pixel values indicate topographical information, rather than brightness and/or color of the objects in the scene. Depth maps may equivalently be referred to herein as “3D maps,” “depth images,” “depth sequences,” or “3D images.”
Depth maps may be processed in order to segment, identify, and localize objects and their components in the scene. In particular, descriptors (e.g., so-called “features,” as will be discussed in further detail below) may be extracted from the depth map based on the depth values of the pixels in a plurality of patches (i.e., areas) distributed in respective positions over objects in the scene that are trying to be identified (e.g., a human hand). Identification of humanoid forms (i.e., 3D shapes whose structure resembles that of parts of a human being) in a depth map, and the exact poses of these parts, which may change from frame to frame, may be used as a means for controlling computer applications.
As will be described further herein, novel techniques have been developed by the inventors to: detect, track, and verify the presence and location of human hands within a video stream of image data by leveraging background-invariant depth image features and bi-directional tracking heuristics.