1. Field of the Invention
This invention generally relates to machine vision sensing and, more particularly, to a system and method for user interface (UI) object tracking using color gradient measurements.
2. Description of the Related Art
As noted in Wikipedia, the combination of cameras and computers are unable to “see” in the same way as human beings. While people can rely on inference systems and assumptions, computing devices must “see” by examining individual pixels of images, processing the images, and attempting to develop conclusions with the assistance of knowledge bases and features such as pattern recognition engines. Although some machine vision algorithms have been developed to mimic human visual perception, a number of unique processing methods have been developed to process images and identify relevant image features in an effective and consistent manner. Machine vision and computer vision systems are capable of processing images consistently, but computer-based image processing systems are typically designed to perform single, repetitive tasks, and despite significant improvements in the field, no machine vision or computer vision system can yet match some capabilities of human vision in terms of image comprehension, tolerance to lighting variations and image degradation, and parts' variability.
A typical machine or computer vision solution may include several of the following components. Light sources, sometimes very specialized, (e.g., LED illuminators, fluorescent, or halogen lamps) are used to illuminate a field of view. One or more digital or analog cameras (black-and-white or color) are typically required, with suitable optics for acquiring images. A lens may be used to focus on the desired field of view. A camera interface makes the images available for processing. For analog cameras, this includes digitization of the images. A framegrabber is a digitizing device (within a smart camera or as a separate computer card) that converts the output of the camera to digital format (typically a two dimensional array of numbers, corresponding to the luminous intensity level of the corresponding point in the field of view, called pixels) and places the image in computer memory so that it may be processed by the machine vision software.
Data from analog or digital cameras typically requires modification for use in machine vision systems. By use of calibration methods known in the art, corrective values are determined to compensate for differences in response across an array of pixel level detectors.
Additional data is utilized to linearize detector response, to correct for lens distortion, or to construct transforms, tables, and other methods designed to gamut map camera device color representations to other colorspaces.
A processor (often a PC or embedded processor, such as a DSP) processes the digital images, executing software instructions that are part of an image recognition application stored in a computer-readable memory. In some cases, all of the above components are combined within a single device, called a smart camera. Input/output (I/O) communication links (e.g., a network connection or RS-232) report the results. In some aspects, a synchronizing sensor may be used to detect movement, to trigger image acquisition and processing.
The software typically takes several steps to process an image. Often the image is first manipulated to reduce noise or to convert many shades of gray to a simple combination of black and white (binarization). Following the initial simplification, the software may count, measure, and/or identify objects, dimensions or features in the image. Commercial and open source machine vision software packages typically include a number of different image processing techniques. Pixel counting counts the number of light or dark pixels. Thresholding converts an image with gray tones to simply black and white. Segmentation is used to locate and/or count parts. Blob discovery & manipulation inspects an image for discrete blobs of connected pixels (e.g., a black hole in a grey object) as image landmarks. Recognition-by-components is the extraction of geons from visual input. Robust pattern recognition locates an object that may be rotated, partially hidden by another object, or varying in size. Gauging measures object dimensions (e.g., inches). Edge detection finds object edges, and template matching finds, matches, and/or counts specific patterns.
There exists a clear distinction between machine vision and computer vision. Computer vision is more general in its solution of visual problems, whereas machine vision is an engineering discipline mainly concerned with industrial problems.
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face or hand. Current focuses in the field include emotion recognition from the face and hand gesture recognition. Many approaches have been made using cameras and computer vision algorithms to interpret sign language. However, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition can be seen as a way of building a richer bridge between machines and humans than primitive text user interfaces or even GUIs (graphical user interfaces), which still limit the majority of input to keyboard and mouse.
Gesture recognition enables humans to interface with the machine (HMI) and interact naturally without any mechanical devices. Using the concept of gesture recognition, it is possible to point a finger at the computer screen so that the cursor will move accordingly. This could potentially make conventional input devices such as mouse, keyboards and even touch-screens redundant.
Gesture recognition is useful for processing information from humans which is not conveyed through speech or type. As well, there are various types of gestures which can be identified by computers. Just as speech recognition can transcribe speech to text, certain types of gesture recognition software can transcribe the symbols represented through sign language into text. By using proper sensors (accelerometers and gyros) worn on the body of a patient and by reading the values from those sensors, robots can assist in patient rehabilitation.
Pointing gestures have very specific meanings in all human cultures. The use of gesture recognition, to determine where a person is pointing, is useful for identifying the context of statements or instructions. This application is of particular interest in the field of robotics. Controlling a computer through facial gestures is a useful application of gesture recognition for users who may not physically be able to use a mouse or keyboard. Eye tracking in particular may be of use for controlling cursor motion or focusing on elements of a display. Foregoing the traditional keyboard and mouse setup to interact with a computer, strong gesture recognition could allow users to accomplish frequent or common tasks using hand or face gestures to a camera.
Gestures can also be used to control interactions within video games to try and make the game player's experience more interactive or immersive. For systems where the act of finding or acquiring a physical controller could require too much time, gestures can be used as an alternative control mechanism. Controlling secondary devices in a car, or controlling a television set are examples of such usage. In affective computing, gesture recognition is used in the process of identifying emotional expression through computer systems. Through the use of gesture recognition, “remote control with the wave of a hand” of various devices is possible. The signal must not only indicate the desired response, but also which device to be controlled.
Depth-aware or time-of-flight cameras can be used to generate a depth map of what is being seen through the camera at a short range, and this data used to approximate a 3Dd representation. These approximations can be effective for detection of hand gestures due to their short range capabilities. Using two cameras whose position to one another is known (stereo cameras), a 3D representation can be approximated by the output of the cameras.
There are many challenges associated with the accuracy and usefulness of gesture recognition software. For image-based gesture recognition there are limitations on the equipment used and image noise. Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult. The variety of implementations for image-based gesture recognition may also cause issues for the viability of the technology to general usage. For example, an algorithm calibrated for one camera may not work for a different camera. The amount of background noise also causes tracking and recognition difficulties, especially when occlusions (partial and full) occur. Furthermore, the distance from the camera, and the camera's resolution and quality, also cause variations in recognition accuracy. In order to capture human gestures by visual sensors, robust computer vision methods are also required, for example for hand tracking and hand posture recognition or for capturing movements of the head, facial expressions or gaze direction.
“Gorilla arm” was a side-effect that destroyed vertically-oriented touch-screens as a mainstream input technology despite a promising start in the early 1980s. Designers of touch-menu systems failed to notice that humans aren't designed to hold their arms in front of their faces making small motions. After more than a very few selections, the arm begins to feel sore, cramped, and oversized, while using the touch screen for anything longer than short-term use.
FIG. 1 is a two-dimensional projection depicting three-dimensional color gamuts (prior art). The overall horseshoe shape is a projection of the entire range of possible chromaticities. That is, the projection represents an outer boundary of the range, or gamut, of all colors perceivable by the human visual system. The triangle and its interior represents the visual color gamut producible by a typical computer monitor, which creates color by additively mixing various amounts of red, green, and blue lights, where the intensities of these lights are controlled by red/green/blue (RGB) device signals. The monitor gamut does not fill the entire visual color space. The corners of the triangle are the primary colors for this monitor gamut. In the case of a cathode ray tube (CRT), they depend on the colors of the phosphors of the monitor. The oval shape drawn with dotted lines represents the gamut producible by a device such as a color printer that is controlled by cyan/magenta/yellow (CMY) or cyan/magenta/yellow/black (CMYK) device signals. In the case of a printer, the colors actually produced in response to these signals are dependent upon the colorant properties, the colorant application processes, the viewing illumination, and the print media. For a color output device, its gamut is a certain complete subset of colors that can be accurately represented by the device. The gamut conceptually consists of the set of human-perceivable colors produced by driving the device with all valid combinations of device signals.
Human-perceivable colors that cannot be produced by some particular color output device are said to be out-of-gamut for that device. For example, the pure red of a particular type of CRT or LCD monitor, produced by setting the RGB device signals to (R=max, G=0, B=0) may be out-of-gamut for of a particular type of printer, which may be controlled via CMYK device signals. The converse is also possible. That is, a printer might be able to produce some colors which a monitor cannot produce. While processing a digital image, the most convenient color model used is the RGB model. In practice, the human-perceivable color associated with each image RGB value is often (tacitly or explicitly) assumed to be the color produced by the user's display monitor, or the color obtained by applying formulae of a standardized specification such as sRGB.
A color space may be defined by a number of characteristics or attributes. For example, the gamut of a device may be specified by hue, saturation, or brightness. Thus, a full color gamut must be represented in three dimensions (3D) of attributes. When a device signal vector is presented to an output device, the device produces a CIE (Commission internationale de l'éclairage) color. CIE colors can be denoted by XYZ tristimulus coordinates, or by derived coordinates such as L*a*b* or the like. For example, electrophotographic (EP) color printers use CMYK colorants to produce color, and the device signal vectors are 4-tuples consisting of C, M, Y, K percentage amounts. Allowing these percentages to independently vary through their entire physical range (or within a ‘valid’ range, considering process limitations) results in a set of colors in CIE color space filling a volume called the color gamut of the device.
In computing, a color gradient specifies a range of position-dependent colors, generally as an alternative to specifying a single color. The colors produced by a gradient vary continuously with position, producing smooth color transitions. A linear color gradient is specified by two points, and a color at each point. The colors along the line through those points are calculated using linear interpolation, then extended perpendicular to that line. Similarly, a real-world chromatic light source may be constructed to project one or more beams in which the color transition within the beam varies continuously with position. Conventionally, “luminance” is used as a measure of intensity per unit area that is measurable by instruments. In color spaces derived from vision models, this measurement is transformed and scaled to behave more like human perception and becomes “lightness”. The term “brightness” is usually reserved for a perceptually scaled value.
Increasingly, the methods to control real world devices such as computers and computer-based information systems are attempting to utilize natural human communication modalities. However, there are accuracy limits and confounding problems which limit the success of these control approaches. In addition, hand gesture control provides little feedback to the user, other than the control response from the system dependent on the interface. If that system fails to respond, the user is uninformed as to the causality and may be forced to repeat and modify actions until the desired response is achieved.
It would be advantageous if an improved method existed for tracking objects such as human gestures, for the purpose of creating a more intuitive computer user interface.
It would be advantageous the above-mentioned object tracking method could be improved by using feedback to provide the user with visual cues.