It is useful to enable a user to interact with the display of an electronic device by touching regions of the display, for example with a user's finger or a stylus. Existing so-called touch screens may be implemented with sensors and receptors arranged to provide a virtual (x,y) grid on the device display surface. Such mechanisms can sense where on the display user-contact was made. Newer touch screens may be implemented using more advanced capacitive or resistive sensing, or acoustic wave sensing to provide better touch resolution. Some prior art displays can sense multiple user touch points and implement user commands such as zoom, pan, rotate, etc. However these known systems require placing a sense layer over the typically LCD layer.
Understandably the cost of the resultant system will increase with increases in the display size, i.e., the LCD layer. Retrofitting touch sensing to an existing device LCD can be difficult, if not impossible.
Rather than use touch sensing, camera-based optical sensing can be used to implement a two-dimensional planar touch screen system. Using an LCD screen as an example, a camera and an optical emitter, perhaps an IR LED, are disposed at each upper corner region of the screen with (x,y) fields of view (FOV) that ideally encompass all of the screen, i.e., 90° FOV in a perfect system. The emissions from the two optical emitters and the FOVs of the two cameras ideally overlap. In the z-plane, normal to the (x,y) plane of the LCD, the FOV is very narrow. The vertical sides and the horizontal bottom of the inner surfaces of the display bezel are lined with retro-reflective strips that reflect-back energy from the two optical emitters. Understandably these retro-reflective strips add to the overall thickness and cost of the display and bezel, and typically cannot be fitted retroactively to an existing LCD.
In such systems, when the user touches a region of the LCD screen within the overlapping FOVs of the emitted optical energy and the two cameras, the user's finger (or other object) blocks or interrupts camera detection of optical energy that normally is reflected-back by the retro-reflective strips. This interruption is sensed by the two cameras as a “blob” and provides a go/no-go indication that a region (x,y) of the LCD screen surface has been touched. Any color information associated with the object that blocked reflected-back optical energy is ignored. Each camera has a sensor with a row-column array of pixels. An exemplary camera sensor array might comprise 10-20 rows×500-600 columns and provide good detection in a fairly large (x,y) plane but very narrow detection range in the z-plane. One can determine the (x,y) location of the touch on the display screen surface by combining the centroid of the blob using triangulation providing information is present from both cameras. Thus, a user interaction involving two fingers (x1,y1), (x2,y2) will not be properly sensed if one finger (or object) occludes the other finger. Note that such systems do not detect any information in a three-dimensional hovering region spaced-apart from the display screen surface, i.e., z>0. Thus any gesture(s) attempted by the user prior to actually touching the screen surface do not result in useful detection information or interaction.
In many systems it is desirable to allow the user to interact with a display, both in a three-dimensional hover region that is spaced-apart from the display surface (z>0) as well as on the (x,y) surface of the display screen. So-called time-of-flight (TOF) systems can implement such true three-dimensional sensing, and many U.S. patents for TOF systems have been awarded to Canesta, Inc., formerly of Sunnyvale, Calif. Such TOF systems emit active optical energy and determine distance (x,y,z) to a target by counting how long it takes for reflected-back emitted optical energy to be sensed, or by examining phase shift in the reflected-back emitted optical energy. The TOF sensor is an array of pixels, each of which produces a depth (z) signal and a brightness signal for the imaged scene. The pixel array density will be relatively low, in the QVGA or VGA class, yet the silicon size will be rather large because a typical TOF pixel is many times larger than a typical RGB camera pixel. TOF systems acquire true three-dimensional data and triangulation is not needed to detect an (x,y,z) location of an object on the surface of a display (x,y,0) or in a three-dimensional hover region (x,y,z z>0) spaced-apart from the display surface.
Although they can provide true three-dimensional (x,y,z) data, TOF systems can be relatively expensive to implement and can require substantial operating power. Environmental factors such as high ambient light, system temperature, pixel blooming, electronic noise, and signal saturation can all affect the accuracy of the acquired (x,y,z) data. Operational overhead associated with acquiring three-dimensional data can be high for a touchscreen hovering application. Identifying a user's finger in an (x,y,z) hover zone for purposes of recognizing a gesture need only require identifying perhaps ten points on the finger. But a TOF system cannot simply provide three-dimensional data for ten points but must instead image the entire user's hand. If in the TOF system the pixel array comprises say 10,000 pixels, then the cost of acquiring 10,000 three-dimensional data points must be borne, even though only perhaps ten data points (0.1% of the acquired data) need be used to identify (x,y,z), and (x,y,0) information.
So-called structured-light systems are an alternative to TOF systems. Structured-light systems can be employed to obtain a three-dimensional cloud of data for use in detecting a user's hovering interactions with a display screen. A structured light system projects a stored, known, calibrated light pattern of spots on the target, e.g., the display surface. As the user's hand or object approaches the display surface some of the projected spots will fall on and be distorted by the non-planar hand or object. Software algorithms can compare the internally stored known calibrated light pattern of spots with the sensed pattern of spots on the user's hand to calculate an offset. The comparison can produce a three-dimensional cloud of the hover zone that is spaced-apart from the display surface. A group of pixels is used to produce a single depth pixel, which results in low x-y resolution. Unfortunately structured light solutions require special components and an active light source, and can be expensive to produce and require substantial operating power. Furthermore, these systems require a large form factor, and exhibit high latency, poor far depth resolution, and unreliable acquired close distance depth data as a function of pattern projector and lens architecture. Other system shortcomings include pattern washout under strong ambient light, a need for temperature management, difficulty with sloped object surfaces, severe shadowing, and low field of view.
Common to many prior art hover detection systems is the need to determine and calculate (x,y,z) locations for thousands, or tens of thousands, or many hundreds of thousands of points. For example, a stereo-camera or TOF prior art system using a VGA-class sensor would acquire (640.480) or 307,200 (x,y) pixel locations from which such systems might produce perhaps 80,000 to 300,000 (x,y,z) location points. If a high definition (HD-class) sensor were used, there would be (1280.720) or 921,600 (x,y) pixel locations to cope with. Further, stereoscopic cameras produce a poor quality three-dimensional data cloud, particularly in regions of the scene where there is no texture, or a repetitive pattern, e.g., a user wearing a striped shirt. The resultant three-dimensional data cloud will have missing data, which makes it increasingly difficult for detection software to find objects of interest. As noted above, the overhead cost of producing three-dimensional data for every pixel in the acquired images is immense. The computational overhead and data throughput requirements associated with such large quantity of calculations can be quite substantial. Further special hardware including ASICs may be required to handle such massive computations.
Occlusion remains a problem with the various prior art systems used to implement natural user interface applications with single optical axis three-dimensional data acquisition cameras. Occlusion occurs when a part of the scene cannot be seen by the camera sensor. In TOF systems and in structured light systems, depth (x,y,z) calculations can only be performed on regions of the scene visible to both the actively emitted optical energy and to the camera sensor. Occluded objects can be less troublesome for systems that employ multiple cameras as the scene is simultaneously viewed from multiple different vantage points. In general, traditional multi-camera systems including those employing a base line also have problems producing a three-dimensional cloud of data efficiently, especially when the imaged scene includes repeated patterns, is texture-free, or has surface reflections.
Regardless of its implementation, a system to detect user interaction with the surface of a display screen and with the adjacent three-dimensional hover region must meet industry specifications to be commercially viable. For example Microsoft© Corp. Windows© 7 touch WQHL qualification for a touch application requires accuracy of an initial user touch to be within 2.5 mm of the displayed target. Further, line drawing accuracy must remain within a 2.5 mm boundary of a guide line with line jitter less than 0.5 mm in a 10 mm interval. Presently, there is no minimal requirement for the accuracy and jitter of a pointer object, e.g., a user's finger, in the hover region.
What is needed is a method and system to sense user interaction in a three-dimensional hover zone adjacent to the surface of the display on a monitor, as well as optionally sensing user-interaction with the monitor display surface itself. The system preferably should meet industry accuracy and resolution standards without incurring the cost, large form factor, and power consumption associated with current commercial devices that acquire three-dimensional data. Such method and system should function without specialized components, and should acquire data from at least two vantage points using inexpensive ordinary imaging cameras without incurring the performance cost and limitations, and failure modes of current commercial multi-view optical systems. Computationally, such system should expend resource to determine and reconstruct (x,y,z) data points only for those relatively few landmark points relevant to the application at hand, without incurring the overhead and cost to produce a three-dimensional cloud. Preferably such system should be compatible with existing imaging applications such as digital photography, video capture, and three-dimension capture. Preferably such system should be useable with display sizes ranging from cell phone display to tablet display to large TV displays. Preferably such system should provide gesture recognition in a hover zone that can be quite close to a display screen surface, or may be many feet away. Preferably such system should have the option to be retrofittably installed in an existing display system.
The present invention provides such systems and methods for implementing such systems.