1. Field of Invention
The present invention is directed to the field of object recognition in digital images. More specifically, it is directed towards the field of stereo computer vision and the recognition of specific target objects (or classes of objects) and their relative positions/orientations in a pair of stereoscopic images.
2. Description of Related Art
In the field of computer vision, it is generally desirable that an image not only be captured, but that a computer be able to identify and label various features within the captured image. Basically, a goal of computer vision is for the computer to duplicate the abilities of human vision by electronically perceiving and understanding the contents of a captured image. This involves extracting symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory. Thus, the field of computer vision includes methods for acquiring, processing, analyzing, and gleaning an understanding of imaged objects, in order to form decisions.
Computer vision has been used and applied to a large range of fields. Applications range from machine vision in industrial systems (where for example, a computer may inspect products on an assembly line or aid in the assembly of the products themselves), to human recognition, to research into artificial intelligence in an attempt for the computer to comprehend the world around it.
Various approaches to identifying features within a captured image are known in the industry. Many early approaches centered on the concept of identifying shapes. For example, if a goal was to identify a specific item, such as a wrench or a type of wrench, then a library of the different types of acceptable wrenches (i.e. examples of “true” wrenches) would be created. The outline shapes of the true wrenches would be stored, and a search for the acceptable shapes would be conducted on a captured image. Shapes within a captured image might be identified by means of a segmentation process where the outline of foreground objects is differentiated from an image's background. This approach of shape searching was successful when one had an exhaustive library of acceptable shapes, the library was not overly large, the subject of the captured images did not deviate from the predefined true shapes, and the background was not overly complicated.
For complex searches, however, this approach is not effective. The limitations of this approach become readily apparent when the subject being sought within an image is not static, but is prone to change. For example, cable harnesses have definite characteristics, but may take many different shapes and arrangements due to their wiring lacking a rigid structure. As another example, a human face has definite characteristics, but does not have an easily definable number of shapes and/or appearance it may adopt. It is to be understood that the term appearance is herein used to refer to color and/or light differences across an object, as well as other surface/texture variances. The difficulties in understanding a human face becomes even more acute when one considers that a human face is prone to shape distortion and/or change in appearance within the normal course of human life due to changes in emotion, expression, age, etc. It is self-apparent that compiling an exhaustive library of human faces, or any non-rigid or amorphous object, and their many variations is a practical impossibility. Thus, statistical methods have been developed to address these difficulties.
Developments in image recognition of objects that change their shape and appearance, are discussed in “Statistical Models of Appearance for Computer Vision”, by T. F. Cootes and C. J. Taylor (hereinafter Cootes et al.), Imaging Science and Biomedical Engineering, University of Manchester, Manchester M13 9PT, U.K. email: t.cootes@man.ac.uk, http://www.isbe.man.ac.uk, Mar. 8, 2004, which is hereby incorporated in its entirety by reference.
As Cootes et al., explain, in order for a machine to be able to understand what it “sees”, it must make use of models that describe and label the expected structure being imaged. In the past, model-based vision has been applied successfully to images of man-made, rigid objects having limited and known variations. Model-based vision, however, has proven more difficult in interpreting images of non-rigid object having unknown variations, such as images of natural subjects, which tend to be complex and variable. A problem is the variability of the subject being examined. To be useful, a model needs to be specific, that is, it should represent only true examples of the modeled subject. The model, however, also needs to be general and represent any plausible example (i.e. any possible true example) of the class of object it represents.
Recent developments have shown that this apparent contradiction can be handled by statistical models that can capture specific patterns of variability in shape and appearance. It has further been shown that these statistical models can be used directly in image interpretation.
To facilitate the application of statically models, subjects to be interpreted are typically separated into classes. This permits the statistical analysis to use prior knowledge of the characteristics of a particular class to facilitate its identification and labeling, and even to overcome confusion caused by structural complexity, noise, or missing data.
Additionally, in order to facilitate further processing of identified and labeled subjects within a captured image, it is beneficial for the identified subject to be transformed into (i.e. be fitted onto) a predefined, “model” shape with predefined locations for labeled items. For example, although the human face may take many shapes and sizes, it can be conformed to a standard shape and size. Once conformed to the standard shape and size, the transformed face can then be further processed to determine its expression, determine its gaze direction, identify the individual to whom the face belongs, etc.
A method that uses this type of alignment is the active shape model. With reference to FIG. 1, the active shape model uses a predefined model of a class of object, such as human face 1A in the present example, and a list of predefined deformation parameters, each having corresponding deformation constraints, to permit the predefined model to be stretched and move to attempt to align it with a subject image 2. Alternatively, the list of predefined deformation parameters may be applied to subject image 2, and have it be moved and deformed to attempt to align it with the predefined model 1A. This alternate approach has the added benefit that once subject image 2 has been aligned with the predefined model 1A, it will also be fitted to the shape and size of the predefined model 1A, which facilitates the identifying of individual parts of the subject image 2 in accordance with labels on the predefined model 1A.
For illustrative purposes, FIG. 1 shows predefined model (i.e. model face) 1A being fitted to subject image (i.e. subject face) 2. The example of FIG. 1 is an exaggerated case for illustration purposes. It is to be understood that a typical model face 1A would have constraints regarding its permissible deformation points relative to other points within itself. For example, if aligning the model face meant moving its left eye up one inch and moving its right eye down one inch, then the resultant aligned image would likely not be a human face, and thus such a deformation would typically not be permissible.
In the example of FIG. 1, the model face 1A is first placed roughly within the proximity of predefined points of interest, and typically placed near the center of subject face 2, as illustrated in image 3. By comparing the amount of misalignment resulting from moving model face 1A in one direction or another, and the results of adjusting a size multiplier in any of several predefined directions, one can determine how to better align model face 1, as illustrated in image 4. An objective would be to align as closely as possible predefined landmarks, such as the pupils, nostril, mouth corners, etc., as illustrated in image 5. Eventually, after a sufficient number of such landmark points have been aligned, the subject image 2 is warped onto model image 1A resulting in a fitted image 6 with easily identifiable and labeled points of interest that can be further processed to achieve specific objectives.
This approach, however, does not take into account changes in appearance, i.e. shadow, color, or texture variations for example. A more holistic, or global, approach that jointly considers the object's shape and appearance is the Active Appearance Model (AAM). Although Cootes et al. appear to focus primarily on the gray-level (or shade) feature of appearance, they do describe a basic principle that AAM searches for the best alignment of a model face (including both model shape parameters and model appearance parameters) onto a subject face while simultaneously minimizing misalignments in shape and appearance. In other words, AAM applies knowledge of the expected shapes of structures, their spatial relationships, and their gray-level appearance (or more generally color value appearance, such as RGB values) to restrict an automated system to plausible interpretations. Ideally, AAM is able to generate realistic images of sought objects. An example would be a model face capable of generating convincing images of any individual, changing their expression and so on. AAM thus formulates interpretation as a matching problem: given an image to interpret, structures are located and labeled by adjusting the model's parameters in such a way that it generates an ‘imagined image’ that is as similar as possible to the real thing.
Although AAM is a useful approach, implementation of AAM still poses several challenges. As stated above, an AAM machine generates results from the application of statistical analysis of a library of true samples to define distinguishing parameters and the parameter's permissible distortions. By the nature of the statistical analysis, the results will permit alignment only with a fraction of all true samples. If the subject category is prone to a wide range of changes, such as cable harness that can take any distortion when dropped onto an assembly line (such as a conveyor belt), the model may not be able to properly align itself to an input subject image with characteristics beyond the norm defined by the shape or appearance model.
Another limitation of an AAM machine is that construction of the model (or conical) image (i.e. model face 1A in the example of FIG. 1), requires much human intervention to identify the distinguishing features of the specific object being sought.
For example with reference to FIG. 2, model face 1A may be constructed from a library of training images 1 (i.e. true face images). Typically, a user manually places “landmark” points on each training image to outline specific features characteristic to the class of object being represented. The landmark points are ideally selected in such a way that the landmark points outline distinguishable features within the class common to every training image. For instance, a common feature within a face class may be the eyes, and when building a model of the appearance of an eye in a face image, landmark points may be placed at the corners of the eye since these features would be easy to identify in each training image. In addition to the landmark points, however, an active appearance model (AAM) machine also makes use of appearance data (i.e. shade data and/or color data and/or texture data, etc.) at various patches of each training image to create a distribution range of acceptable appearances for corresponding patches within model face 1A. This appearance data constitutes additional features in the overall statistical analysis.
Thus, an AAM machine may be too complicated and computationally intensive for practical machine vision applications in industrial assembly lines where the object class is prone to great deformation, such as when the object class is one or more types of wire harnesses. Thus, machine vision applications typically rely on more automated methods of identifying characteristic features and object edges in a captured image. Additionally if a machine is expected to interact with an object in an assembly line, such as if a robot is intended to pick up a specific type of wire harness from a bin of multiple wire harnesses and attached (i.e. plug) a specific end of the harness to a specific receptacle, the machine will need some sort of depth perception to properly manipulate the robot.
Thus, edge detection algorithms are part of many image manipulation operations. Edge detection is fundamental to image processing and computer vision, particularly in the areas of feature detection and feature extraction. Edge detection aims to identify points, i.e. pixels that outline objects within an image. There are many edge detection algorithms, but generally they attempt to identify pixels at which discontinuities occurs, i.e. where the image brightness changes sharply. In the ideal case, the result of applying an edge detector to an image leads to a set of connected curves that indicate the boundaries of objects, the boundaries of surface markings, and discontinuities in surface orientation. Once the boundaries have been identified, various image processing operations may be applied to the digital image.
For example FIG. 3A shows a typical digital image, and FIG. 3B shows the results of applying edge detection to the image of FIG. 3A. Edge detection may be designed to identify thick or thin lines, or may be optimized to separately identify thick and thin lines. In the example of FIG. 3B, both thick and thin lines are separately identified, which permits them to be separately processed. This permits the processing of the digital image to be more specialized by adjusting the size of a pixel-processing window according to line thickness. As a result, application of a specific image processing algorithms, such a bilateral filter, may be optimized along the edge of objects according to line thickness to achieve a sharper final image, as shown in FIG. 3C.
Another use of edge detection is feature detection. As an example, if one has a library of identifying features of a specific object, then one may search an input digital image for those identifying features in an effort to determine if an example of the specific object is present in the input digital image. When this is extended to multiple digital images of a common scene taken from different view angles, it is possible to index, i.e. match or correlate, feature points from one image to the other. This permits the combined processing of the multiple digital images.
For example in FIG. 4, images 7A, 7B, 7C and 7D each provide partial, and overlapping, views of a building in a real-world scene, but none provide a full view of the entire building. However, by applying edge detection and indexing (i.e. identifying matching pairs of) feature points in the four partial images 7A, 7B, 7C and 7D that correlate to the same real feature point in the real-world scene, it is possible to stitch together the four partial images (i.e. applying an image stitching tool) to create one composite image 7E of the entire building. The four partial images 7A, 7B, 7C and 7D of FIG. 4 are taken from the same view angle, but this approach may be extended to the field of correspondence matching, where images of a common scene are taken from different view angles.
Images of a common scene are taken from different view angles are the basis for stereo vision and depth perception. In this case, corresponding feature points in two images taken from different view angles (and/or different fields of vision) of the same subject (or scene) can be combined to create a perspective view of the scene. Thus, imaging a scene from two different view points (i.e. from two different fields of vision, FOV) creates stereo vision, which provides depth information about objects in the scene.
This ability would be particularly helpful in the field of robotics and automated assembly/constructions. In these applications, a machine having stereo vision and the ability to discern (i.e. identify) target items would have the ability to independently retrieve the target item and use it in an assembly.
Implementing such vision capabilities, however, is still a challenge, even in a specialized assembly line where the number of possible target object variants is limited. The challenges become even more daunting when the target objects are amorphous, or otherwise prone to change in shape and/or appearance, such as in the case of wire harnesses.
It is an object of the present invention to provide a stereo vision capability suitable for discerning a target object in a perspective (3D) scene.
It is a further object of the present invention to provide such a stereo vision capability suitable for use with wire harness, and other amorphous objects.