Two-dimensional (2D) inputs have been used for many years to control a graphical user interface. The 2D input, for example the common computer mouse, has worked well with a user interface that was simplified to represent primarily 2D objects, such as text on a sheet of paper, drawings and photographs. Because of limited computing power and limited display capabilities, the computing industry largely accepted this limited input means.
Computing power has vastly increased over the years, and costs have dramatically decreased. Applications which operate in three dimensions (3D) are much more common. However, 3D inputs have lagged.
There are technologies to track 3D fingers and body parts from 2D images. For example, a “dataglove” or “cyberglove” system may use wired sensors such as magnetic or inertial tracking device to directly capture physical data such as bending of fingers. A motion capture system may use active markers such as light emitting diodes (LEDs) or passive markers coated with a retro-reflective material to reflect light so that body parts can be easily located in images from multiple 2D views and the 3D location can be computed. However, the requirement of attaching sensors and markers has slowed adoption of the above technologies.
Recently, depth sensors such as KINECT have emerged as a new user input device and been successfully used to track human body movement. However, due to the limitation of the underlying technologies (structure-light, time-of-flight, etc.), the resolution of the depth map is low, and the sensors have difficulty detecting close up objects. Therefore, such sensors are not suitable for tracking subtle movement of small objects such as fingers.
Traditional stereo vision systems have numerous limitations. One drawback is that two or more cameras are needed. For high resolution and high frame rate cameras, bandwidth also may pose a problem. To handle fast motion, stereo vision systems need synchronization hardware to synchronize images from different cameras. The two cameras usually need to be aligned to be coplanar and an image rectification step is required. Thus, stereo vision systems have to choose between small baseline (with small sensor size, large field of view (FOV) but large error in depth estimation) and large baseline (with small error in depth but large sensor size and small FOV).