The present invention relates generally to software based processing techniques to improve performance and data obtained from range finder type image sensors, including three-dimensional range finder type image sensors, and sensing systems such as described in the above-reference co-pending applications in which a sensor array detects reflected source emissions and can determine x,y,z coordinates of the target reflecting the source emissions.
Computer systems that receive and process input data are well known in the art. Typically such systems include a central processing unit (CPU), persistent read only memory (ROM), random access memory (RAM), at least one bus interconnecting the CPU, the memory, at least one input port to which a device is coupled input data and commands, and typically an output port to which a monitor is coupled to display results. Traditional techniques for inputting data have included use of a keyboard, mouse, joystick, remote control device, electronic pen, touch panel or pad or display screen, switches and knobs, and more recently handwriting recognition, and voice recognition.
Computer systems and computer-type systems have recently found their way into a new generation of electronic devices including interactive TV, set-top boxes, electronic cash registers, synthetic music generators, handheld portable devices including so-called personal digital assistants (PDA), and wireless telephones. Conventional input methods and devices are not always appropriate or convenient when used with such systems.
For example, some portable computer systems have shrunk to the point where the entire system can fit in a user""s hand or pocket. To combat the difficulty in viewing a tiny display, it is possible to use a commercially available virtual display accessory that clips onto an eyeglass frame worn by the user of the system. The user looks into the accessory, which may be a 1xe2x80x3 VGA display, and sees what appears to be a large display measuring perhaps 15xe2x80x3 diagonally.
Studies have shown that use of a keyboard and/or mouse-like input device is perhaps the most efficient technique for entering or editing data in a companion computer or computer-like system. Unfortunately it has been more difficult to combat the problems associated with a smaller size input device, as smaller sized input devices can substantially slow the rate with which data can be entered. For example, some PDA systems have a keyboard that measures about 3xe2x80x3xc3x977xe2x80x3. Although data and commands may be entered into the PDA via the keyboard, the entry speed is reduced and the discomfort level is increased, relative to having used a full sized keyboard measuring perhaps 6xe2x80x3xc3x9712xe2x80x3. Other PDA systems simply eliminate the keyboard and provide a touch screen upon which the user writes alphanumeric characters with a stylus. Handwriting recognition software within the PDA then attempts to interpret and recognize alphanumeric characters drawn by the user with a stylus on a touch sensitive screen. Some PDAs can display an image of a keyboard on a touch sensitive screen and permit users to enter data by touching the images of various keys with a stylus. In other systems, the distance between the user and the computer system may preclude a convenient use of wire-coupled input devices, for example the distance between a user and a set-top box in a living room environment precludes use of a wire-coupled mouse to navigate.
Another method of data and command input to electronic devices is recognizing visual images of user actions and gestures that are then interpreted and converted to commands for an accompanying computer system. One such approach was described in U.S. Pat. No. 5,767,842 to Korth (1998) entitled xe2x80x9cMethod and Device for Optical Input of Commands or Dataxe2x80x9d. Korth proposed having a computer system user type on an imaginary or virtual keyboard, for example a keyboard-sized piece of paper bearing a template or a printed outline of keyboard keys. The template is used to guide the user""s fingers in typing on the virtual keyboard keys. A conventional TV (two-dimensional) video camera focused upon the virtual keyboard was stated to somehow permit recognition of what virtual key (e.g., printed outline of a key) was being touched by the user""s fingers at what time as the user xe2x80x9ctypedxe2x80x9d upon the virtual keyboard.
But Korth""s method is subject to inherent ambiguities arising from his reliance upon relative luminescence data, and indeed upon an adequate source of ambient lighting. While the video signal output by a conventional two-dimensional video camera is in a format that is appropriate for image recognition by a human eye, the signal output is not appropriate for computer recognition of viewed images. For example, in a Korth-type application, to track position of a user""s fingers, computer-executable software must determine contour of each finger using changes in luminosity of pixels in the video camera output signal. Such tracking and contour determination is a difficult task to accomplish when the background color or lighting cannot be accurately controlled, and indeed may resemble the user""s fingers. Further, each frame of video acquired by Korth, typically at least 100 pixelsxc3x97100 pixels, only has a grey scale or color scale code (typically referred to as RGB). Limited as he is to such RGB value data, a microprocessor or signal processor in a Korth system at best might detect the contour of the fingers against the background image, if ambient lighting conditions are optimal.
The attendant problems are substantial as are the potential ambiguities in tracking the user""s fingers. Ambiguities are inescapable with Korth""s technique because traditional video cameras output two-dimensional image data, and do not provide unambiguous information about actual shape and distance of objects in a video scene. Indeed, from the vantage point of Korth""s video camera, it would be very difficult to detect typing motions along the axis of the camera lens. Therefore, multiple cameras having different vantage points would be needed to adequately capture the complex keying motions. Also, as suggested by Korth""s FIG. 1, it can be difficult merely to acquire an unobstructed view of each finger on a user""s hands, e.g., acquiring an image of the right forefinger is precluded by the image-blocking presence of the right middle finger, and so forth. In short, even with good ambient lighting and a good vantage point for his camera, Korth""s method still has many shortcomings, including ambiguity as to what row on a virtual keyboard a user""s fingers is touching.
In an attempt to gain depth information, the Korth approach may be replicated using multiple two-dimensional video cameras, each aimed toward the subject of interest from a different viewing angle. Simple as this proposal sounds, it is not practical. The setup of the various cameras is cumbersome and potentially expensive as duplicate cameras are deployed. Each camera must be calibrated accurately relative to the object viewed, and relative to each other. To achieve adequate accuracy the stereo cameras would like have to be placed at the top left and right positions relative to the keyboard. Yet even with this configuration, the cameras would be plagued by fingers obstructing fingers within the view of at least one of the cameras. Further, the computation required to create three-dimensional information from the two-dimensional video image information output by the various cameras contributes to the processing overhead of the computer system used to process the image data. Understandably, using multiple cameras would substantially complicate Korth""s signal processing requirements. Finally, it can be rather difficult to achieve the necessary camera-to-object distance resolution required to detect and recognize fine object movements such as a user""s fingers while engaged in typing motion.
In short, it may not be realistic to use a Korth approach to examine two-dimensional luminosity-based video images of a user""s hands engaged in typing, and accurately determine from the images what finger touched what key (virtual or otherwise) at what time. This shortcoming remains even when the acquired two-dimensional video information processing is augmented with computerized image pattern recognition as suggested by Korth. It is also seen that realistically Korth""s technique does not lend itself to portability. For example, the image acquisition system and indeed an ambient light source will essentially be on at all times, and will consume sufficient operating power to preclude meaningful battery operation. Even if Korth could reduce or power down his frame rate of data acquisition to save some power, the Korth system still requires a source of adequate ambient lighting.
Power considerations aside, Korth""s two-dimensional imaging system does not lend itself to portability with small companion devices such as cell phones because Korth""s video camera (or perhaps cameras) requires a vantage point above the keyboard. This requirement imposes constraints on the practical size of Korth""s system, both while the system is operating and while being stored in transit.
There exist other uses for three-dimensional images, if suitable such images can be acquired. For example, it is known in the art to use multiple video camera to create three-dimensional images of an object or scene, a technique that is common in many industrial and research applications. With multiple cameras, distance to a target point is estimated by software by measuring offset of the pixel images of the same point in two simultaneous frames obtained by two cameras such that a higher offset means a greater distance from target to the cameras. But successful data acquisition from multiple cameras requires synchronization among the cameras, e.g., using a synch box. Proper camera calibration is also required, including knowledge of the distance between cameras for input to a distance estimator algorithm. The use of multiple cameras increases system cost, especially where each camera may cost from $100 to $1,000 or more, depending upon the application. Distance measurement accuracy degrades using multiple cameras if the cameras are placed too close together, a configuration that may be demanded by mobile image acquiring systems. Further, the image processing software can encounter difficulty trying to match pixels from the same target in frames from two different cameras. Moving objects and background patterns can be especially troublesome. Understandably, extracting distance information from multiple cameras requires processing and memory overhead, which further contributes to workload of the application. As noted, prior art video cameras disadvantageously generally require sufficient ambient lighting to generate a clear image of the target.
Prior art systems that use other single-modal methods of input, e.g., using only speech recognition or only gesture recognition, frequently encounter performance problems with erroneous recognition of what is being input., especially when used in noisy or other less than ideal environments.
What is needed is an accurate method of determining three-dimensional distances, preferably acquired from a single camera that is operable without dependence upon ambient light. One such camera system was disclosed in applicants"" reference applications, although such camera system has uses beyond what is described in applicants"" referenced patent application(s). There is a need for use with such camera system, and with three-dimensional distance measurement systems in general, measurement techniques to reduce z-measurement error. Further, such measurement techniques should exhibit improved x-y resolution and brightness values, preferably using a process that does not unduly tax the computational ability or power consumption requirements of the overall system used to acquire the images. Further, such software techniques used with such method and system should correct for geometric error, and enable RGB encoding.
There is also a need for a multi-modal interface such as voice recognition combined with gesture recognition that can reduce recognition errors present in single-modal interfaces, e.g., speech recognition, and can result in inproved overall system performance.
The present invention provides software implementable techniques for improving the performance of such methods and systems, and is applicable to a wide range of three-dimensional image acquisition systems.
Applicants"" referenced applications disclose systems to collect three-dimensional position data. One such system enables a user to input commands and data (collectively, referred to as data) from a passive virtual emulation of a manual input device to a companion computer system, which may be a PDA, a wireless telephone, or indeed any electronic system or appliance adapted to receive digital input signals. The system included a three-dimensional sensor imaging system that was functional even without ambient light to capture in real-time three-dimensional data as to placement of a user""s fingers on a substrate bearing or displaying a template that is used to emulate an input device such as a keyboard, keypad, or digitized surface. The substrate preferably is passive and may be a foldable or rollable piece of paper or plastic containing printed images of keyboard keys, or simply indicia lines demarking where rows and columns for keyboard keys would be. The substrate may be defined as lying on a horizontal X-Z plane where the Z-axis define template key rows, and the X-axis defines template key columns, and where the Y-axis denotes vertical height above the substrate. If desired, in lieu of a substrate keyboard, the invention can include a projector that uses light to project a grid or perhaps an image of a keyboard onto the work surface in front of the companion device. The projected pattern would serve as a guide for the user in xe2x80x9ctypingxe2x80x9d on this surface. The projection device preferably would be included in or attachable to the companion device.
The disclosed three-dimensional sensor system determined substantially in real time what fingers of the user""s hands xe2x80x9ctypedxe2x80x9d upon what virtual key or virtual key position in what time order. Preferably the three-dimensional sensor system included a signal processing unit comprising a central processor unit (CPU) and associated read only memory (ROM) and random access memory (RAM). Stored in ROM is a software routine executed by the signal processing unit CPU such that three-dimensional positional information is received and converted substantially in real-time into key-scan data or other format data directly compatible as device input to the companion computer system. Preferably the three-dimensional sensor emits light of a specific wavelength, and detects return energy time-of-flight from various surface regions of the object being scanned, e.g., a user""s hands. Applicants"" referenced applications disclosed various power saving modes of operation, including low 1 to perhaps 10 pulse/second repetition rates during times of non-use, during which times low resolution data could still be acquired. When the system determines that an object entered the imaging field of view, a CPU governing system operation commands entry into a normal operating mode in which a high pulse rate is employed and system functions are operated at full power.
In applicants"" earlier disclosed system, three-dimensional data was used to implement various virtual input devices, including virtual keyboards. The user""s fingers were imaged in three dimensions as the user xe2x80x9ctypedxe2x80x9d on virtual keys. The disclosed sensor system output data to a companion computer system in a format functionally indistinguishable from data output by a conventional input device such as a keyboard, a mouse, etc. Software preferably executable by the signal processing unit CPU (or by the CPU in the companion computer system) processes the incoming three-dimensional information and recognizes the location of the user""s hands and fingers in three-dimensional space relative to the image of a keyboard on the substrate or work surface (if no virtual keyboard is present).
As disclosed in the referenced application, the software routine preferably identified contours of the user""s fingers in each frame by examining Z-axis discontinuities. When a finger xe2x80x9ctypedxe2x80x9d a key, or xe2x80x9ctypedxe2x80x9d in a region of a work surface where a key would be if a keyboard (real or virtual) were present, a physical interface between the user""s finger and the virtual keyboard or work surface was detected. The software routine examined preferably optically acquired data to locate such an interface boundary in successive frames to compute Y-axis velocity of the finger. (In other embodiments, lower frequency energy such as ultrasound could instead be used.) When such vertical finger motion stopped or, depending upon the routine, when the finger made contact with the substrate, the virtual key being pressed was identified from the (Z, X) coordinates of the finger in question. An appropriate KEYDOWN event command could then be issued, and a similar analysis was performed for all fingers (including thumbs) to precisely determine the order in which different keys are contacted (e.g., are xe2x80x9cpressedxe2x80x9d). In this fashion, the software issued appropriate KEYUP, KEYDOWN, and scan code data commands to the companion computer system. Virtual xe2x80x9ckeyxe2x80x9d commands could also toggle the companion computer system from data input mode to graphics mode. Errors resulting from a drifting of the user""s hands while typing, e.g., a displacement on the virtual keyboard were correctable, and hysteresis was provided to reduce error from inadvertent user conduct not intended to result in xe2x80x9cpressingxe2x80x9d a target key. The measurement error was further reduced utilizing a lower Z-axis frame rate than used for tracking X-values and Y-values. Attempts were made to average Z-axis acquired data over several frames to reduce noise or jitter.
The present invention provides further improvements to the acquisition and processing of data obtained with three-dimensional image systems, including systems as above-described. In general, the methods disclosed in the present invention are applicable to systems in which three-dimensional data is acquired with statistically independent measurements having no real correlation between data-acquiring sensors. The parent patent application with its array of independent pixels was one such system. The present invention improves measurement accuracy of data acquisition in such systems, in that such systems characteristically exhibit system noise having a relatively large random component. Collectively, software and techniques according to the present invention include over-sampling, the use of various averaging techniques including moving averages, averaging over pixels, intra-frame averaging, using brightness values to reduce error, correcting for geometric error including elliptical error. Advantageously, the present invention also permits encoding Z-distance in RGB.
In summary, the techniques disclosed herein enhance the performance of such systems including/direct three-dimensional sensing cameras using mathematical and heuristic techniques to reduce error, increase resolution, and optionally integrate data with the existing RGB data type. The improved data processing techniques may be practiced with a range of systems including virtual data input device systems, other hands-free interaction with computing systems, game machines and other electrical appliances, including use in the fields of security and identification measurement. In the various embodiments of the present invention, three-dimensional data measurements of objects in the field of view of a camera are acquired and processed at video speed substantially in real time. In one embodiment, reflective strips are disposed on objects within the viewed scene to enhance three-dimensional measurement performance.