Remote control of devices including video devices has evolved from use of IR or acoustic type remote controls held by a user to control television sets and the like equipped with IR or acoustic recognition systems, to imaging systems that attempt to image the user in two or preferably three dimensions to recognize movements or gestures intended to control the television or other device. FIG. 1 depicts a generic prior art system 10 in which a device 20, here a television, is remotely controlled by a user 30, whose head is shown from the back in the figure. System 10 includes at least one imaging system, here a camera, e.g., 40-1, 40-2, coupled electronically to a signal processor unit 50, whose processor output can control operation of television 20.
The field of view of camera(s) 40-1, 40-2 encompasses at least a portion of three-dimensional space in which the user can make gestures, for example with at least one hand (e.g., left hand 60) to control television 20. If conventional RGB or gray scale images are acquired, then typically two spaced-apart cameras 40-1, 40-2 will be employed. Ideally, allowable gestures would include moving user hand(s) towards or away from television 20, but RGB or gray scale cameras, including a pair of such cameras disposed stereographically, might not correctly discern such movement relative to system 10. RGB or gray scale cameras are readily confused by ambient lighting including light generated by the television display itself, by the clothing of the user, e.g., a white hand in front of a user's white shirt, by reflectivity of objects within the field of view, etc.
Various imaging systems that seek to acquire three-dimensional images of a user creating gestures intended to control a device are known in the art. Some three-dimensional imaging systems use so-called parallel techniques and may include two-cameras, such as shown in FIG. 1. Various two-camera implementations include so-called passive stereo in which a sparse depth map is created in which only some sensor pixels in the depth map actually contain depth information. Another two camera approach to acquiring depth images is texture patterned stereo, in which the depth system creates a pattern that generates texture but does not encode depth information. If a speckle-like randomly patterned illumination is used, there may be sufficient texture in the imaged scene to enable creation of a dense depth map. Yet another type of two camera imaging system is depth-coded patterned stereo, in which a patterned Illumination source codes depth information and can provide a dense depth map. A problem common to many two-camera systems is occlusion and so-called correspondence ambiguity. It can be challenging to combine the imagery acquired by two spaced-apart cameras to unambiguously determine depth in an imaged scene.
Some parallax imaging methods use a single camera with a patterned source of illumination. So-called structured light systems can create a near-far qualitative depth map, but may suffer from an imprecise baseline. PrimeSense, an Israeli company, markets such structured light systems. So-called active stereo single camera systems can acquire a dense depth map with a precise baseline.
Another and somewhat superior method of three-dimensional imaging uses time-of-flight (TOF) information to create a dense depth map. Canesta, Inc. of Sunnyvale, Calif. (assignee herein) has received several dozen U.S. patents directed to methods and systems that can acquire true depth images. Exemplary such U.S. patents received by Canesta, Inc. include U.S. Pat. No. 6,323,942 (2001) CMOS-Compatible Three-Dimensional Image Sensor IC, U.S. Pat. No. 6,515,740 (2003) Methods for CMOS-Compatible Three-Dimensional Image Sensing Using Quantum Efficiency Modulation, U.S. Pat. No. 6,522,395 (2003) Noise Reduction Techniques Suitable for Three-Dimensional Information Acquirable with CMOS-Compatible Image Sensor ICs, U.S. Pat. No. 6,614,422 (2003) Methods for Enhancing Performance and Data Acquired from Three-Dimensional Image Systems, U.S. Pat. No. 6,674,895 (2004) Methods for Enhancing Performance and Data Acquired from Three-Dimensional Image Systems, U.S. Pat. No. 6,678,039 (2004) Method and System to Enhance Dynamic Range Conversion Useable with CMOS Three-Dimensional Imaging, U.S. Pat. No. 6,710,770 (2004) Quasi-Three-Dimensional Method and Apparatus to Detect and Localize Interaction of User-Object and Virtual Transfer Device, U.S. Pat. No. 6,906,793 (2005) Methods and Devices for Charge Management for Three-Dimensional Sensing, U.S. Pat. No. 7,151,530 (2006) System and Method for Determining an Input Selected by a User Through a Virtual Interface, U.S. Pat. No. 7,176,438 (2007) Method and System to Differentially Enhance Sensor Dynamic Range Using Enhanced Common Mode Reset, U.S. Pat. No. 7,212,663 (2007) Coded-Array Technique for Obtaining Depth and Other Position Information of an Observed Object, U.S. Pat. No. 7,321,111 (2008) Method and System to Enhance Differential Dynamic Range and Signal/Noise in CMOS Range Systems Using Differential Sensors, U.S. Pat. No. 7,340,077 (2008) Gesture Recognition System Using Depth Perceptive Sensors, U.S. Pat. No. 7,352,454 (2008) Methods and Devices for Improved Charge Management for Three-Dimensional and Color Sensing, and U.S. Pat. No. 7,507,947 (2009) Method and System to Differentially Enhance Sensor Dynamic Range Using Enhanced Common Mode Reset.
Typically a TOF system emits optical energy and determines how long it takes until at least some of that energy is reflected by a target object and arrives back at the system to be detected by an array of pixel detectors. If t1 denotes roundtrip TOF time, then the distance between target object and the TOF system is Z1, where Z1=t1·C/2, where C is velocity of light. Most Canesta TOF systems are phase-based and compare shift between phase of the modulated emitted optical energy and phase of the reflected energy in determining depth Z. Canesta TOF systems are operable with or without ambient light, have no moving parts, and can be mass produced using CMOS techniques. Phase-based TOF systems are also believed available from PMD Technology of Siegen, Germany, Mesa Imaging, AG of Zurich, Switzerland, and possibly Optrima NV of Brussel, Belgium.
Another method of TOF systems that does not measure phase shift is the shutter type TOF system. The shutter may be an active optic device, perhaps GaAs as developed by 3DV Corp. of Israel, or perhaps an electronic shutter, e.g., CMOS, as developed by TriDiCam GmbH of Germany.
Three-dimensional imaging may be accomplished without using a parallel method, or a TOF method, for example by using spaced-apart cameras from whose images relative or inferred depth Z information may be had. Such systems are believed to be developed by XTR 3D Company of Israel. Alternative methods for inferring depth may rely upon camera motion, so-called structure-from-motion analysis, but these methods are not deemed sufficiently fast for use in a gesture recognition system. Other methods for inferring depth include so-called depth-from-focus techniques in which the focal plane of an imaging camera is changed to create a depth map. However such techniques may not be adequately fast or accurate for real-time gesture recognition.
Having briefly reviewed the various methods known in the art for obtaining depth or Z images, consider now an exemplary prior art approach to gesture recognition with reference to FIG. 1. Assume that system 10 include a display 20 whose characteristic(s) a user 30 will attempt to influence or alter using user-made gestures that are imaged here by spaced-apart cameras 40-1, 40-2. System 10 is what may be termed device-centric and typically requires closed-loop visual feedback between user 10 and a portion of what is displayed on television system 20. FIG. 1 shows, for example, a cursor 70 near the upper left corner on the television display, and also shows a double-arrow icon 80 near the right edge of the television display. In this example, if the user can cause cursor 70 to move to the right, in the direction of phantom cursor 70′, and overlie the upper or lower portion of arrow 80, the user can thus cause an increase or decrease in the sound volume from television 20.
In practice, system 10 will have pre-defined several gestures that the user will know a priori. For example, to move cursor 70 to the right, the user may move the left hand to the right, as indicated by the position of phantom hand 60′. Unfortunately doing so involves hand-eye coordination between the displayed cursor on television 20, and the user's hand position, as imaged by cameras 40-1, 40-2. The (x, y, z) coordinate system relied upon by system 10 is an absolute coordinate system that is defined relative to television set 20. This coordinate system means that the distance ΔX′ through which the user's hand must be moved to move the cursor a distance ΔX on the television display is not constant. Thus, if the user is say 8′ (2.5 m) away from the television set, distance ΔX′ will be substantially greater than if the user were say 4′ (1.25 m) away from the television set. In addition to this varying distance sensitivity, the user must keep an eye on the cursor position. In the example of FIG. 1, once the user moves the cursor to the desired up or down portion of double arrow 80, the user might then confirm this selection, perhaps by moving the hand in the direction of the television screen. Having thus executed the desired correction to the television volume, system 10 can automatically remove both the cursor and double arrow from the television display. If the user later wishes to make some other adjustment, perhaps to change channels on television 20, the user will make some other gesture known to system 10, and the cursor and other relevant icon(s) or images will appear on the television screen.
While device-centric systems such as described in FIG. 1 can work, more or less, there is room for improvement. The necessity for hand-eye feedback between the user and what is presented on the television screen may not be desirable for all classes of users. Such feedback requires some user training, e.g., how much hand movement will cause how much screen movement at what distance away from the screen. This need for user training may arise because device-centric systems have variations as the user changes position relative to the device that can affect the user's feel for device control. For example, unless scaling is done correctly, an action that requires some subtle motion when the user is far from the device can require large motions when the user is close to the device. Unless addressed in some fashion, this scaled feedback characteristic of some prior art systems can limit the type of user gestures that can reliably be recognized and acted upon.
Further, systems such as described in FIG. 1 are device-centric in that the three-dimensional coordinate system used by the system is defined relative to the device, and not to the user. In some applications the device-centric nature of the system can result in ambiguous recognition of what device control action was intended by a given user gesture. For example, a user gesture intended to increase the channel number should not be misinterpreted as a user desire to increase the device volume setting, etc.
What is needed is a remote control method and system that does not require hand-eye feedback between the user and the device being controlled. Preferably such method and system would employ a user-centric relative coordinate system rather than an absolute device-centric coordinate system. Such method and system would free the user from undue concentration upon the device screen to implement remote control. Preferably such method and system should use three-dimensional rather than two-dimensional image sensing, be intuitive to the user, and not require substantial user training. Further such method and system should reliably recognize user gestures without ambiguous interpretations. Gestures should be user-friendly to perform and remember, and should be defined to be unambiguous with good detection discrimination characteristics. Preferably gestures should have no state, e.g., nothing to remember, and should permit transitioning to another gesture unambiguously.
The present invention provides such a remote control method and system.