1. Field
The subject matter disclosed generally relates to the field of human-machine interfacing. More particularly, the subject matter relates to a method and system for interfacing with a computer using a vision based mechanism.
2. Related Prior Art
Using some form of gesture recognition to interact with a machine/computer is the subject of research and development for decades, but technical difficulties associated with this subject have imposed various limitations which impair the usefulness of the existing solutions.
Existing solutions either use distant vision (in the order of meters) that focuses on very specific gestures of objects not smaller than a hand, or short distance vision (in the order of centimeters) that focuses on either a single finger or some easy convex posture of fingers in a simplified scene.
The content of the scene is also a determinant factor that eases an implementation of gesture recognition. Scene segmentation had always been an issue. The most sophisticated object recognition solution is able to deal with two dimensional images only, while many other simplified solutions benefit from the availability of affordable sensors delivering depth information. Solutions based on depth sensors became very popular because they ease most of the artificial image recognition and interpretation. These sensors allow simplified scene segmentation using methods as crude as defining a “volume of engagement” which merely establishes the start of analysis on the most luminous object. In case of depth sensors the concept of luminosity is a transposition of and designated the closest. Several issues exist with depth sensors as well as with structured light based sensors (or stereoscopic based) and their limitations extend to both long distance and short distance usage. With regard to the long distance usage, the lack of precision is incompatible with small object like fingers. The smallest viable object is hand size. Additionally, the long distance resolution of depth is also a restriction that cannot be overcome as easily with high resolution 2D sensor or lens. Short distance is also an issue: interaction close to a monitor or a touchpad requires that fingers be at short distance, within 50 cm or even 20 cm of the sensors. This is too close to sensors to allow take-up.
Accordingly, the requirements of take-up in the prior art systems are always an issue. The technical difficulties associated with gesture recognition have increased the number of prior art solutions. The prior art can be typically divided between user Interface methods and object recognition methods.
User Interface prior art generally include wide scope description of usage, the intention is described but the analysis is always vague, which raises questions regarding the enablement of such systems. Examples of user interface references include: U.S. Pat. No. 6,359,572; U.S. Pat. No. 7,877,706; U.S. Pat. No. 7,821,531; and US20100199228.
Object recognition prior art put the main focus on the harsh reality of problems related to object and gesture recognition, and take into consideration the gesture of the body or hands or single finger. This type of prior art however, does not extend to the individual nor collective movement of the fingers. Examples of object recognition references include: U.S. Pat. No. 6,256,598; US200902524231; U.S. Pat. No. 6,788,809;
Solutions which use distant vision associate a body gesture to a function, key, or word. See for example US20100199228. The hand gesture shown in FIG. 5a of this reference signals the character “a”, and the hand gesture shown in FIG. 5b signals the word “cat”. The major limitation with this type of solutions is the lack of efficiency and speed in interfacing with the machine, not to mention the fatigue associated with moving the entire hand or head for entering a single character or function.
On the other hand, solutions which use short distance vision limit the user to a specific location (physical and non-physical) for interfacing with the machine.
For example U.S. Pat. No. 5,767,842 (Korth) describes a system in which the keyboard is optically produced on a surface. Korth (U.S. Pat. No. 5,767,842) assumes a clean background and contrasts the color of the hand/wrist with the color of the background in order to detect the presence of the hand. The hand contour system is then followed in order to detect the fingers and the location of their tips.
In addition to being limited to a specific location, the method of Korth is subject to inherent ambiguities arising from the reliance upon relative luminescence data, an adequate source of ambient lighting, and a clean background. For example, from the vantage point of Korth's video camera, it would be very difficult to detect typing motions along the axis of the camera lens. Therefore, multiple cameras having different vantage points would be needed to adequately capture the complex keying motions. Also, as suggested by Korth's FIG. 1, it can be difficult merely to acquire an unobstructed view of each finger on a user's hands, e.g., acquiring an image of the right forefinger is precluded by the image-blocking presence of the right middle finger, and so forth. In short, even with good ambient lighting and a good vantage point for his camera, Korth's method still has many shortcomings, including ambiguity as to what row on a virtual keyboard a user's fingers is touching.
The Korth approach may be replicated using multiple two-dimensional video cameras, either for stereoscopic reconstruction or for implementing different method of search from different point of view each aimed toward the subject of interest from a different viewing angle. As simple as this proposal sounds, it is not practical. The setup of the various cameras is cumbersome and potentially expensive as duplicate cameras are deployed. Each camera must be calibrated accurately relative to the object viewed, and relative to each other. To achieve adequate accuracy, the stereo cameras would have to be placed at the top left and right positions relative to the keyboard. The principle of stereo reconstruction requiring analysis of tonal difference is also too sensitive to lighting conditions compared to the required accuracy necessary. Yet even with this configuration, the cameras would be plagued by fingers obstructing fingers within the view of at least one of the cameras. Further, the computation required to create three-dimensional information from the two-dimensional video image information output by the various cameras contributes to the processing overhead of the computer system used to process the image data. Understandably, using multiple cameras would substantially complicate Korth's signal processing requirements, and increase power consumption.
Another solution is provided in U.S. Pat. Nos. 6,614,422 and 6,710,770 (Rafii). Similar to Korth, Rafii uses a projector for projecting the image of a keyboard on a surface, and uses a three dimensional sensor in order to capture the position of the finger on the keyboard projected on the surface. In addition to requiring a surface to type on and limiting the user to a specific location, the system of Rafii has at least two major drawbacks. The first drawback is the difficulty to detect the hit due to sensor precision constraints on the user vertical axis. In particular, since the camera POV is above the user to track fingers rows displacement, the hit movement detection is in an unfavorable situation and requires very high precision that should impose requirements for wide vertical movements, otherwise spurious hits can be detected. The second drawback is the power consumption. The camera uses the TOF (time of flight) or light, and requires high speed electronics which lead to unacceptable tradeoffs between high power consumption for precise electronics (higher speed) and lower consumption which reduce dramatically the choice of technology available to detect a glimpse instant like the time for light to travel from sensors emitter to fingers then back to receiver which is in the picoseconds range. Additionally, the Rafii approach requires controlled illumination for sensors similar to the radar principle. This approach is also prone to errors when used in an exterior environment or highly lighted places where the light tends to increase background noise for the sensors, which for safety and cost reasons work in a extension of visible light spectrum that are still part of the light emitted by the sun.
A further solution is described in US20100231522 (Li). Li describes different implementations in this reference. In the implementation where only one camera is used, such as in FIG. 5, the user's finger should contact a predetermined input region ABCD on a physical surface. In the implementation where the user types in the air, the system requires two video capturing devices as shown in FIG. 22. In the latter implementation, the user is also limited to typing in an input region which lies in an input plane. The input plane (that can be assimilated as a virtual desk top) has to be perpendicular to the line of sight of the video capturing devices. Accordingly, in both implementations the user is limited a specific input region for typing and/or interfacing with the computer. When it comes to the second implementation which allows for typing in the air, it would not be realistic to use Li because the necessary input plane is unseen and must be perpendicular to the line of sight of the video capturing devices. Therefore, it is very easy for the user to type in the wrong place and enter different characters than what they intended to enter. If the paradigm looks close to a keyboard emulation, it is in fact requiring a very new training from the user as the wrist needs to be bent and the hit plane does not materialize (as in Korth). Therefore, merely considering the movement of the hit is not sufficient. Furthermore, Li does not address the various technical difficulties associated with typing in the air. Moreover, the use of a stereovision system is cumbersome, expensive and complicated as discussed above in connection with Korth.
What is needed is a method and system by which a user may input data to a machine/computer using a virtual keyboard or other virtual input device with low computation cost and without being limited to a specific location. The present embodiments provide such method and system.