Direct user interaction with video displays has become relatively common in recent years. A decade ago it was common for users to interact with information shown on a computer video display by manipulating a mouse, perhaps to select an option from a displayed menu, and/or perhaps to drag a user-selected object or object region from position to position on the video display. But such interaction was not necessarily intuitive in that a user new to computers would not necessarily know what to do without some instruction. Further, although such prior art user interaction techniques could often detect and distinguish between a user selection manipulation and a user object drag manipulation, such techniques could not readily detect when the user or user-controlled stylus hovered near but not yet contacting the video display screen.
More recently, a class of small sized computing type devices has gained favor in which a user can directly touch a display to manipulate information being displayed, e.g., to select, to drag, but not to detect when the user or user-controlled stylus merely hovers near the display surface without touching the surface. One very popular such device is the handheld iPodTouch® unit, produced by Apple Corp®. This device has a small, approximately 8 cm diagonal, active screen whose display responds to gestures made by a user's fingers. It is believed that this screen contains an X-Y grid of capacitive sensing lines that are not visible to the user. FIG. 1 depicts an exemplary such device 10 and active screen 12, within which an X-Y grid of capacity sensing lines 14 are formed at time of manufacture. Electronics within device 10 can cause images such as virtual menu buttons 16 to be displayed dynamically on the screen. As the user-object, here finger(s) 18 touch the surface of display 12, finger capacity is sensed by the grid lines 14, and the appropriate quantized region(s) of interaction, such as regions 20A, 20B, 20C, 20D can be detected by electronics 22 within the device. Understandably as pitch or density of the grid lines increases, a smaller region, e.g., rectangles bounded by grid lines, of user-interaction can be defined and sensed.
Execution of software associated with electronics 22 enables device 10 to recognize and respond to gestures made by user fingers interacting with the surface of the screen. For example, two fingers 18 moving up or down can scroll the screen image up or down, e.g., to display additional menu buttons 16 in FIG. 1. Two fingers moving toward each other can shrink the size of an image displayed on the screen; if the two fingers are moved away from each other the image size expands, in zoom fashion. If the two fingers are rotated, the displayed image will rotate in the same direction. Such user interaction and the associated gesture recognition are extremely intuitive in that little or no instruction is required to master use of the device. In other applications, gestures using more than two fingers may be recognized. Such active gesture recognition has the advantage of not relying upon a light source for detection; the capacitive interaction could be detected in a dark room, although to be useful, the display itself will be illuminated.
Although the above-described gesture recognition technology seems to work quite well, production economics dictate that the size of screen 12 be relatively small, e.g., about 3.5″ (9 cm) diagonally. Further, the user-object that interacts with the specialized screen must actually touch the display surface to be detected, and cannot merely hover near the display surface. Finally, the user-object must exhibit capacitance to be detected as it interacts with the screen surface. Thus while a user's fingers touching the screen are recognized by the device, if the user wore gloves, perhaps in cold weather, no capacitance would be sensed through the gloves, and no user gesture would be recognized. A similar result occurs if the user-object were a passive wooden or plastic stylus held by the user. Thus, sensing here requires active, not passive, interacting contact with the screen by a user-object, e.g., an object that possesses meaningful magnitude of capacitance. Note too that the screen itself must be dedicated to the type of gesture sensing, e.g., there must be a grid of capacitive (or other) sense lines provided when the display screen is manufactured. As such, it is hard to apply this technology retroactively to an existing off-the-shelf type display.
Another approach to recognizing user gestures made with at least two user objects (e.g., fingers, styli) with respect to an image on a video display involves use of stereographic cameras and triangulation. U.S. Pat. No. 6,266,048 (2001) to Carau entitled “Method and Apparatus for a Virtual Display Keyboard for a PDA” describes a stereographic method by which user interaction with a virtual keyboard is said to be feasible. However as will be described with respect to exemplary FIG. 2A, stereographic acquisition of image data can result in acquisition of fictitious and ambiguous data, which can render the stereographic approach less than ideal. FIG. 2A depicts a display screen 32 to which are attached two spaced-apart cameras 34A, 34B, whose two fields of view (FOV-A, FOV-B) attempt to encompass the working surface of the display screen. Corner-mounting of the cameras advantageously requires the smallest possible usable FOV, namely 90°. Wherever they are mounted, the two cameras should not have their optical axes in coaxial juxtaposition, otherwise blocking interference will result. In FIG. 2A, cameras 34A and 34B try to simultaneously capture images of the user objects, e.g., fingers 18-1, 18-2, and the display, including images presented on the display such as virtual menu buttons 36. The display screen surface and user-object interaction may be passive in that there is no requirement for embedded sense lines, or other detectors of user-object physical properties such as resistance, force, or the like. Of course there must be sufficient light for the cameras to capture images of user objects and the display screen. The video images captured by the two cameras are processed and triangulation used to try to determine where each user object contacted the display. This approach can avoid requiring a dedicated display, e.g., the display need not be manufactured with any mechanism (grid lines, resistive lines, force or pressure sensors) but other limitations are present. For example, the cameras cannot reliably see and thus distinguish between actual user-object contact with the surface display as opposed to the user-object merely hovering or grazing close to the surface of the display.
A more severe problem with stereographic cameras is that when two user-objects, e.g., fingers 18-1, 18-2 in FIG. 2A touch or otherwise interact with the display, ambiguities can exist as to where on the display the interaction occurred. In stereographic acquisition, every location on the acquired image is the logical “AND” of the imagery acquired from both cameras. Understandably, then, problems arise when the first user-object blocks or occludes view of the second user-object for one of the cameras. In this example, one stereo camera will image the object closer to that camera, but cannot image the other object, whose view if blocked by the first object. In this example, the second camera can see both objects, but the data from which position of the two objects relative to the display is ambiguous, and the point(s) of interaction with the display cannot be correctly determined using triangulation. FIG. 2B depicts ambiguous data, shown as falsely sensed positions (drawn in phantom) for two real user objects 18-1, 18-2. Thus, the two-finger gesture made in FIG. 2B may be misinterpreted by cameras 34A, 34B due to false reporting of the ambiguous image locations. Obviously gestures made with three fingers (or user objects) would further increase the likelihood of occlusion and resultant ambiguity. Thus, three-finger gestures are even more likely to be erroneously determined as to where on the display fingers or other user objects interacted. As a result, a gesture commanding rotation of a displayed image might be wrongly interpreted by the stereographic cameras as a gesture to resize the image. Another problem associated with data acquired from spaced-apart stereographic two-dimensional cameras is that if the system is jarred or bumped, mechanical misalignment can readily result, in which case substantial errors can occur in the acquired data, above and beyond ambiguity-type errors.
A structured light approach to discerning user interaction with a virtual keyboard was described by Tomasi in U.S. Pat. No. 6,710,770 (2004) Quasi-Three-Dimensional Method and Apparatus to Detect and Localize Interaction of User-Object and Virtual Transfer Device, assigned to assignee herein. The '770 patent described a standalone device that projected an image of a keyboard, and detected user interaction with virtual keys on that keyboard using structured light. A thin fan beam of light, perhaps infrared light, was projected parallel to and spaced perhaps 1 mm above the projected image of the virtual keyboard (and virtual keys). A sensor array disposed near the top of the device looked down upon the projected fan beam. The array only sensed light when a user object penetrated the fan beam, and thus reflected energy towards the sensor array. The geometry between the sensor array, the fan beam, and the virtual keyboard was known a priori, and triangulation enabled reliable identification as to which virtual keys were contacted by the user objects, e.g., fingers, and in what temporal order. Advantageously, this device did not rely upon ambient light to function, and would work with passive user-controlled objects, e.g., styli.
FIG. 3 depicts how a Tomasi type system 40 might be used to discern interaction between displayed objects, e.g., menu keys 36, an image 38, etc. on a display screen 32 and a single user object, e.g., a finger 18-1. Within system 40, a laser or LED device 42 emits a very thin fan beam 44 of optical energy in a plane parallel to the flat surface of display screen 32, and spaced-apart from that surface a mm or so, as to barely graze the airspace above the display screen. The emitter laser or LED optical energy need not be visible to the human user. System 40 further includes a single sensor array 46 that looks toward the display screen. Sensor array 46 detects optical energy of the emitted wavelength that is reflected back by anything protruding through the thickness of the fan beam, e.g., reflected optical energy 48. If nothing penetrates the fan beam, then no optical energy is reflected back to be sensed by system 40. But if an object, here user finger 18-1 touches something on the display screen, at least a tip portion of the finger will have penetrated the thickness of the fan beam, and will reflect back some of the emitted laser or LED optical energy. As soon as sensor array 46 within system 40 detects reflected-back optical energy of the appropriate wavelength, it is known that a user object has touched some region of the display screen. The geometry and relationship of emitter 42, fan beam 44, and sensor array 46 vis-à-vis objects, e.g., 36, 38, appearing on display 32 is known a priori.
A Tomasi system 40 as shown in FIG. 3 uses triangulation to determine the (x,z) coordinates of the touched area. A processor within system 40 knows what objects appear at what (x,z) coordinates on display surface 32 and can determine the appropriate response to make to the user interaction with that object. However it will be appreciated that user-object obstruction can occur if for example one finger blocks the view of system 40 of a second finger. Thus, a two-finger gesture might be misinterpreted by system 40. One might dispose a second system 40′ at another corner of the display screen, which system emits a second fan beam 42′, spaced apart vertically from the first fan beam. Preferably different wavelengths would be used to generate fan beams 42, 42′ such that sensors within respective system 40, 40′ would respond to the appropriate reflections. Further details as to design considerations associated with a Tomasi type system may be found in the referenced '770 patent, and will not be repeated here.
Such a dual system might work in theory but problems exist. The geometry of the Tomasi systems does not lend itself well to detecting user interactions with a large video display. In virtual keyboard applications described in the '770 patent, the spaced-apart distance Y between fan beam emitter and sensor array was perhaps 3 cm. But to achieve reasonably accurate triangulation with a truly large display screen, say 36″ or 92 cm diagonally, the distance Y between fan beam emitter and sensor array would be about 15 cm. Such implementation would not be very robust in that system(s) 40, 40′ would project outwardly from the display and be vulnerable to damage from being bumped or vibrated. Further, the a priori geometry needed for successful triangulation would be altered each time outwardly projecting system(s) 40, 40′ were bumped or vibrated. Thus, on one hand, use of a fan beam to detect occasion of user interaction with an object displayed on the screen is feasible. But on the other hand, using a large diagonal screen renders accurate triangulation difficult unless there is a relatively large spaced-apart distance Y between fan beam emitter and sensor array, e.g., perhaps 20%-25% of the display screen diagonal dimension. Further, mechanical vibration or bumping of the large screen display would result in undesired mechanical movement of the sensor array, with resultant errors in performance due to resultant loss of good calibration.
One attempt to implement a touch screen that includes hover detection is described in published U.S. patent application Ser. No. 12/101,527, publication no. 2008/0259053 to Newton, entitled “Touch Screen System With Hover and Click Input Methods”. This method appears to require that the user's finger or stylus “flatten-out” as it contacts the touch screen surface such that the finger or stylus area of contact becomes larger when contact is made than before contact is made. Newton's method uses first and second detectors in proximity to the touch screen that generate images of the user's finger or stylus interacting with the touch screen. The outer edges of the imaged finger or stylus are determined as is the estimated cross-sectional area of the finger or stylus. This area will be smaller before contact with the screen is made, because contact tends to increase the area. FIGS. 3B and 4B depict this increase in area at time of contact for a user's finger, while FIGS. 6A and 6B depict an area increase when using a spring-loaded stylus. If the estimated area does not exceed a threshold area, then it is assumed the object interaction denotes a tracking state, and if the estimated area exceeds a threshold area, then it is assumed a selection state is occurring. How well this method functions is unknown to applicants herein, but absent a physical change in estimated cross-sectional area at time of actual contact, the method will not function. Thus if a user manipulated an ordinary, e.g., rigid, stylus whose contact area could not increase upon contact with the display surface, the described method would appear not to discern between a tracking state and a selection state.
Consider now other approaches to ascertaining distance to an object. It is known in the art to use so-called time-of-flight (TOF) technology to ascertain distance to an object. FIG. 4 shows a TOF system as described in U.S. Pat. No. 6,323,942 to Bamji, et al., entitled “CMOS-Compatible Three-Dimensional Image Sensor IC”, and assigned to assignee herein. In FIG. 4, TOF system 50 emits optical radiation 51 toward a target object 52 and counts the roundtrip time it takes from at least some of the emitted radiation to reflect off the target object as reflected radiation S2 and be detected by the system. The distance Z between system 50 and the target object is given by equation (1) as:Z=C/(2·f)  (1)where C is speed of light, 300,000 Km/sec, and f is frequency of the emitted (and detected) optical energy. In practice, distance Z is known modulo 2·π·C/(2·ω)=C/2·f, where f is the modulation frequency. Thus there can be inherent ambiguity between detected values of phase shift θ and distance Z, and methods including use of multi-frequencies can be employed to disambiguate or dealias the data.
As described in the '942 patent, much if not all of system 50 may advantageously be fabricated on a single IC 54 without need for any moving parts. System 50 includes an array 56 of pixel detectors 58, each of which has dedicated circuitry 60 for processing detection charge output by the associated detector. In a typical application, array 56 might include 100×100 pixels 58, and thus include 100×100 processing circuits 60. Preferably IC 54 also includes a microprocessor or microcontroller unit 62, memory 270 (which preferably includes random access memory or RAM and read-only memory or ROM), a high speed distributable clock 66, and various computing and input/output (I/O) circuitry 68. Among other functions, controller unit 62 may perform distance to object and object velocity calculations. Preferably the two-dimensional array 56 of pixel sensing detectors is fabricated using standard commercial silicon technology, which advantageously permits fabricating circuits 62, 64, 66, 68 on the same IC 54. Understandably, the ability to fabricate such circuits on the same IC with the array of pixel detectors can shorten processing and delay times, due to shorter signal paths.
In overview, system 50 operates as follows. At time t0, microprocessor 62 commands light source 70 to emit a pulse of light of known wavelength (λ) that passes through focus lens 72′ and travels to object 52 at the speed of light (C). At the surface of the object being imaged at least some of the light may be reflected back toward system 50 to be sensed by detector array 56. In one embodiment, counters within system 50 can commence counting when the first light pulse emission S1 is generated, and can halt counting when the first reflected light pulse emission S2 is detected. The '942 patent describes various techniques for such counting, but the further away object 52 is from system 50, the greater will be the count number representing the round-trip time interval. The fundamental nature of system 50 is such that reflected light S2 from a point on the surface of imaged object 52 will only fall upon the pixel (58-x) in array 56 that is focused upon such point.
Light source 70 is preferably an LED or a laser that emits energy with a wavelength of perhaps 800 nm, although other wavelengths could instead be used. Use of emitted light pulses having a specific wavelength, and use of optional lens-filters 74 enables TOF system 50 to operate with or without ambient light, including operating in total darkness.
Within array 56, each pixel detector will have unique (x,y) axis locations on the detection array, and the count output from the high speed counter associated with each pixel detector can be uniquely identified. Thus, TOF data gathered by two-dimensional detection array 230 may be signal processed to provide distances to a three-dimensional object surface. It will be appreciated that output from CMOS-compatible detectors 240 may be accessed in a random manner if desired, which permits outputting TOF DATA in any order.
FIGS. 5A-5C depict a so-called phase shift type TOF system 50′. In such system, distances Z to a target object are detected by emitting modulated optical energy Sout of a known phase, and examining phase-shift in the reflected optical signal Sin from the target object 52. Exemplary such phase-type TOF systems are described in several U.S. patents assigned to Canesta, Inc., assignee herein, including U.S. Pat. Nos. 6,515,740 “Methods for CMOS-Compatible Three-Dimensional Imaging Sensing Using Quantum Efficiency Modulation”, 6,906,793 entitled Methods and Devices for Charge Management for Three Dimensional Sensing, 6,678,039 “Method and System to Enhance Dynamic Range Conversion Useable With CMOS Three-Dimensional Imaging”, 6,587,186 “CMOS-Compatible Three-Dimensional Image Sensing Using Reduced Peak Energy”, 6,580,496 “Systems for CMOS-Compatible Three-Dimensional Image Sensing Using Quantum Efficiency Modulation”. FIG. 5A is based upon the above-referenced patents, e.g. the '186 patent.
In FIG. 5A, exemplary phase-shift TOF depth imaging system 50′ may be fabricated on an IC 54 that includes a two-dimensional array 56 of single-ended or differential pixel detectors 58, and associated dedicated circuitry 60 for processing detection charge output by the associated detector. Similar to the system of FIG. 3, IC 56 preferably also includes a microprocessor or microcontroller unit 62, memory 64 (which preferably includes random access memory or RAM and read-only memory or ROM), a high speed distributable clock 66, and various computing and input/output (I/O) circuitry 68. Among other functions, controller unit 62 may perform distance to object and object velocity calculations.
In system 50′, under control of microprocessor 62, optical energy source 70 is periodically energized by an exciter 76, and emits modulated optical energy toward an object target 52. Emitter 70 preferably is at least one LED or laser diode(s) emitting low power (e.g., perhaps 1 W) periodic waveform, producing optical energy emissions of known frequency (perhaps a few dozen MHz) for a time period known as the shutter time (perhaps 10 ms). Similar to what was described with respect to FIG. 4, emitter 70 typically operates in the near IR range, with a wavelength of perhaps 800 nm. A lens 72 may be used to focus the emitted optical energy.
Some of the emitted optical energy (denoted Sout) will be reflected (denoted Sin) off the surface of target object 20. This reflected optical energy Sin will pass through an aperture field stop and lens, collectively 74, and will fall upon two-dimensional array 56 of pixel or photodetectors 58. When reflected optical energy Sin impinges upon the photodetectors, photons within the photodetectors are released, and converted into tiny amounts of detection current. For ease of explanation, incoming optical energy may be modeled as Sin=A·cos(ω·t+θ), where A is a brightness or intensity coefficient, ω·t represents the periodic modulation frequency, and θ is phase shift. As distance Z changes, phase shift θ changes, and FIGS. 5B and 5C depict a phase shift θ between emitted and detected signals. The phase shift θ data can be processed to yield desired Z depth information. Within array 156, pixel detection current can be integrated to accumulate a meaningful detection signal, used to form a depth image. In this fashion, TOF system 40′ can capture and provide Z depth information at each pixel detector 158 in sensor array 56 for each frame of acquired data.
As described in the above-cited phase-shift type TOF system patents, pixel detection information is captured at least two discrete phases, preferably 0° and 90°, and is processed to yield Z data.
System 50′ yields a phase shift θ at distance Z due to time-of-flight given by:θ=2·ω·Z/C=2·(2·π·f)·Z/C  (2)
where C is the speed of light, 300,000 Km/sec. From equation (2) above it follows that distance Z is given by:Z=θ·C/2·ω=θ·C/(2·2·f·π)  (3)And when θ=2·π, the aliasing interval range associated with modulation frequency f is given as:ZAIR=C/(2·f)  (4)
In practice, changes in Z produce change in phase shift θ but eventually the phase shift begins to repeat, e.g., θ=θ+2·π, etc. Thus, distance Z is known modulo 2·π·C/2·ω)=C/2·f, where f is the modulation frequency. Thus there can be inherent ambiguity between detected values of phase shift θ and distance Z. In practice, multi-frequency methods are used to disambiguate or dealias the phase shift data.
FIG. 6 depicts what is purported to be a TOF method of implementing a large scale virtual interactive screen, as noted in a publication entitled “SwissRanger SR3000 and First Experiences Based on Miniaturized 3D-TOF Cameras”, by Thierry Oggier, et al., Swiss Center for Electronics and Microtechnology (CSEM), Zurich, Switzerland. (Applicants do not know the date of publication other than the date is 2005 or later. Unfortunately the publication does not provide any sort of disclosure other than the system, shown as 50, includes a projector 52 that projects a large 1 m×1.5 m display 54 that presents user-interactable objects that apparently can include display objects, menu keys, etc. At the upper left of the display screen is mounted a TOF camera 56 that is said to capture the scene in front of the displayed screen. The publication says that the task of the TOF camera is to detect and locate movements of the user's hand touching the screen, and that based on the movements, control sequences are sent to a computer. Other than professing to use TOF technology and a single TOF camera, nothing else is known about this CSEM approach. Whether this CSEM system can recognize multi-finger gestures is not stated in the publication.
What is needed is a preferably retrofittable system and method of implementation by which a user can passively interact with large screen video display, and can manipulate objects presented on the display using gestures comprising one, two, or more fingers (or other user-controlled objects). Preferably such system should also detect hovering-type user interactions with the large screen video display, i.e., interactions in which the user manipulated object(s) is in close proximity to the surface of the video display, without being sufficiently close to actually contact the surface, including user-object interaction with virtual scroll regions defined adjacent the video display.
The present invention provides such a system and method.