1. Field of Invention
The present invention is generally directed to the field of robotic manipulation of objects. More specifically, it is directed towards robotic recognition and manipulation of cable harnesses.
2. Description of Related Art
In the field of automated, or robotic, manufacturing or assembly, the ability to identify assembly components, manipulate and attach them to other components is very important. Often, this is achieved by use of assembly stations, where each assembly station is limited to one component having one known orientation and requiring simplified manipulation.
It would be advantageous, however, for a machine to be able to select a needed component from a supply of multiple components, identify any key assembly features of the component, and manipulate the selected component as needed for assembly. This would require that the machine have some capacity for computer vision, object recognition and manipulation.
Before discussing some details of computer vision, however, it is beneficial to first discuss how computer vision has been used in the field of robotic (or machine) vision. Two important aspects of robotic vision are the identifying of an object and the estimating of its pose, i.e., its 3-dimensional (i.e., 3D) orientation relative to a known reference point and/or plane.
Since most cameras take 2-dimensional (i.e., 2D) images, many approaches attempt to identify objects in a 2D image and infer some 3D information from the 2D image. For example, in “Class-specific grasping of 3D objects from a single 2D image”, by Chiu et al., The 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 18-22, 2010, Chiu et al. describe superimposing 2D panels in the form of simplified 2D shapes on the surface of objects in a 2D image. The 2D panels on each imaged object form a set defining the object in the 2D image. The generated 2D panels can then be compared with a library of panel sets that define different types of predefined 3D objects, such as a car. Each library panel set is compared from different view directions with the generated 2D panels of the imaged object in an effort to find a relatively close match. If a sufficiently match is found, then in addition to having identified the object, one has the added benefit of having a good guess as to its orientation given the matched orientation of the 2D panel set of the predefined 3D object in the library.
As a second example is found in “Human Tracking using 3D Surface Colour Distributions” by Roberts et al., Image and Vision Computing, 2006, by Roberts et al. In this example, Roberts et al describe a system where simplified 2D shapes are superimposed on known rigid parts human body (such as the head, torso, arms, etc) as shown in a 2D video image. The movements of the superimposed, simplified 2D shapes follow the movements of the moving human in the 2D video. By analyzing the movements of the 2D shapes, it is possible to discern the movement of the imaged human.
As is stated above, however, identifying a desired object in an image is only part of the solution, particularly when dealing with moving objects. In such cases, one further needs to discern information about the viewed object's pose, or orientation, and possible movement through space. Various approaches have been used to address this need.
For example, in “3D Pose Estimation for Planes”, by Xu et al., Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on Sep. 27, 2009-Oct. 4, 2009. Xu et al. describe using a plane outline on the surface of a target object in a non-stereo image, and estimating the plane's normal direction to estimate the object's pose orientation.
A second example is found in “Robust 3D Pose Estimation and Efficient 2D Region-Based Segmentation from a 3D Shape Prior”, by Dambreville et al., European Conference on Computer Vision ICCV, 2008. Dambreville et al. describe segmenting a rigid, known, target object in a 2D image, and estimating its 3D pose by fitting onto the segmented target object, the best fitting 2D projection of known 3D poses of the known target object.
A third example is provided in “Spatio-temporal 3D Pose Estimation of Objects in Stereo Images” by Barrois et al., Proceedings of the 6th international conference on Computer vision systems, ICVS'08. Barrois et al. describe using a 3D object's normal velocity (defined by the object's main direction of movement) at one point in time to estimate its pose at another point in time along a movement path.
Returning to the subject of computer vision, it is generally desirable that an image not only be captured, but that a computer be able to identify and label various features within the captured image. Basically, a goal of computer vision is for the computer to duplicate the abilities of human vision by electronically perceiving and understanding the contents of a captured image. This involves extracting symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory. Thus, the field of computer vision includes methods for acquiring, processing, analyzing, and gleaning an understanding of imaged objects, in order to form decisions.
Various approaches to identifying features within a captured image are known in the industry. Many early approaches centered on the concept of identifying shapes. For example, if a goal was to identify a specific item, such as a wrench or a type of wrench, then a library of the different types of acceptable wrenches (i.e., examples of “true” wrenches) would be created. The outline shapes of the true wrenches would be stored, and a search for the acceptable shapes would be conducted on a captured image. Shapes within a captured image might be identified by means of a segmentation process where the outline of foreground objects is differentiated from an image's background. This approach of shape searching was successful when one had an exhaustive library of acceptable shapes, the library was not overly large, the subject of the captured images did not deviate from the predefined true shapes, and the background surrounding the target object was not overly complicated.
For complex searches, however, this approach is not effective. The limitations of this approach become readily apparent when the subject being sought within an image is not static, but is prone to change. For example, a human face has definite characteristics, and its distortion is limited, but it still does not have an easily definable number of shapes and/or appearance it may adopt. It is to be understood that the term appearance is herein used to refer to color and/or light differences across an object, as well as other surface/texture variances. Other objects may be prone to far deformation than a human face. For example, cable harnesses have definite characteristics, but may take many different shapes and arrangements due to their wiring lacking many, if any, rigid structure. Nonetheless, it is still helpful to look at some of the computer vision approaches used in face recognition, as some aspects in this field can be applied to computer vision, in general.
Although an exhaustive library of samples of a known rigid body may be compiled for identification purposes, it is self-evident that compiling an exhaustive library of human faces, or any non-rigid or amorphous object, and their many variations is a practical impossibility. Thus, statistical methods have been developed to address these difficulties.
Developments in image recognition of objects that change their shape and appearance, are discussed in “Statistical Models of Appearance for Computer Vision”, by T. F. Cootes and C. J. Taylor (hereinafter Cootes et al.), Imaging Science and Biomedical Engineering, University of Manchester, Manchester M13 9PT, U.K. email: t.cootes@man.ac.uk, http://www.isbe.man.ac.uk, Mar. 8, 2004, which is hereby incorporated in its entirety by reference.
As Cootes et al., explain, in order for a machine to be able to understand what it “sees”, it must make use of models that describe and label the expected structure being imaged. In the past, model-based vision has been applied successfully to images of man-made, rigid objects having limited and known variations. Model-based vision, however, has proven more difficult in interpreting images of non-rigid object having unknown variations, such as images of natural subjects, which tend to be complex and variable. A problem is the variability of the subject being examined. To be useful, a model needs to be specific, that is, it should be limited to representing true examples of the modeled subject. The model, however, also needs to be general and flexible enough to represent other plausible example (i.e., other possible true example not specifically available in a sample library) of the class of object it represents. It has been shown that this apparent contradiction can be handled by statistical models that can capture specific patterns of variability in shape and appearance. It has further been shown that these statistical models can be used directly in image interpretation.
To facilitate the application of statistical models, subjects to be interpreted are typically separated into classes (i.e., category of objects). This permits the statistical analysis to use prior knowledge of the characteristics of a particular class of object to facilitate its identification and labeling, and even to overcome confusion caused by structural complexity, noise, or missing data.
Additionally, in order to facilitate further processing of identified and labeled subjects within a captured image, it is beneficial for the identified subject to be transformed into (i.e., be fitted onto) a predefined, “model” shape with predefined locations for labeled items. For example, although the human face may take many shapes and sizes, it can be conformed to a standard shape and size. Once conformed to the standard shape and size, the transformed face can then be further processed to determine its expression, determine its gaze direction, identify the individual to whom the face belongs, etc.
A method that uses this type of alignment is the active shape model. With reference to FIG. 1, the active shape model uses a predefined model of a class of object, such as human face 1A in the present example, and a list of predefined deformation parameters, each having corresponding deformation constraints, to permit the predefined model to be stretched and move to attempt to align it with a subject image 2. Alternatively, the list of predefined deformation parameters may be applied to subject image 2, and have it be moved and deformed to attempt to align it with the predefined model 1A. This alternate approach has the added benefit that once subject image 2 has been aligned with the predefined model 1A, it will also be fitted to the shape and size of the predefined model 1A, which facilitates the identifying of individual parts of the subject image 2 in accordance with labels on the predefined model 1A.
For illustrative purposes, FIG. 1 shows predefined model (i.e., model face) 1A being fitted to subject image (i.e., subject face) 2. The example of FIG. 1 is an exaggerated case for illustration purposes. It is to be understood that a typical model face 1A would have constraints regarding its permissible deformation points relative to other points within itself. For example, if aligning the model face meant moving its left eye up one inch and moving its right eye down one inch, then the resultant aligned image would likely not be a human face, and thus such a deformation would typically not be permissible. It is to be understood, however, that this limitation would not apply to non-rigid object that can take large amounts of deformation, such as cable harnesses.
In the example of FIG. 1, the model face 1A is first placed roughly within the proximity of predefined points of interest, and typically placed near the center of subject face 2, as illustrated in image 3. By comparing the amount of misalignment resulting from moving model face 1A in one direction or another, and the results of adjusting a size multiplier in any of several predefined directions, one can determine how to better align model face 1, as illustrated in image 4. An objective would be to align as closely as possible predefined landmarks, such as the pupils, nostril, mouth corners, etc., as illustrated in image 5. Eventually, after a sufficient number of such landmark points have been aligned, the subject image 2 is warped onto model image 1A resulting in a fitted image 6 with easily identifiable and labeled features of interest that can be further processed to achieve specific objectives.
This approach, however, does not take into account changes in appearance, i.e., shadow, color, or texture variations for example. A more holistic, or global, approach that jointly considers the object's shape and appearance is the Active Appearance Model (AAM). Although Cootes et al. appear to focus primarily on the gray-level (or shade) feature of appearance, they do describe a basic principle that AAM searches for the best alignment of a model face (including both model shape parameters and model appearance parameters) onto a subject face while simultaneously minimizing misalignments in shape and appearance. In other words, AAM applies knowledge of the expected shapes of structures, their spatial relationships, and their gray-level appearance (or more generally color value appearance, such as RGB values) to restrict an automated system to plausible interpretations. Ideally, AAM is able to generate realistic images of sought objects. An example would be a model face capable of generating convincing images of an individual, such as by changing the individual's expression and so on. AAM thus formulates interpretation as a matching problem: given an image to interpret, structures are located and labeled by adjusting the model's parameters in such a way that it generates an ‘imagined image’ that is as similar as possible to the real thing.
Although AAM is a useful approach, implementation of AAM still poses several challenges. As stated above, an AAM machine generates results from the application of statistical analysis of a library of true samples to define distinguishing parameters and the parameter's permissible distortions. By the nature of the statistical analysis, the results will permit alignment only with a fraction of all true samples. If the subject category is prone to a wide range of changes, such as cable harness that can take any distortion when dropped onto an assembly line (such as a conveyor belt) or when picked up, the model may not be able to properly align itself to an input subject image with characteristics beyond the norm defined by the shape or appearance model.
Another limitation of an AAM machine is that construction of the model (or conical) image (i.e., model face 1A in the example of FIG. 1) requires much human intervention to identify the distinguishing features of the specific object being sought.
For example with reference to FIG. 2, model face 1A may be constructed from a library of training images 1 (i.e., true face images). Typically, a user manually places “landmark” points on each training image to outline specific features characteristic to the class of object being represented. The landmark points are ideally selected in such a way that the landmark points outline distinguishable features within the class common to every training image. For instance, a common feature within a face class may be the eyes, and when building a model of the appearance of an eye in a face image, landmark points may be placed at the corners of the eye since these features would be easy to identify in each training image. In addition to the landmark points, however, an active appearance model (AAM) machine also makes use of appearance data (i.e., shade data and/or color data and/or texture data, etc.) at various patches of each training image to create a distribution range of acceptable appearances for corresponding patches within model face 1A. This appearance data constitutes additional features in the overall statistical analysis.
Thus, an AAM machine may be too complicated and computationally intensive for practical machine vision applications in industrial assembly lines where the object class is prone to great deformation, such as when the object class is one or more types of wire harnesses. Thus, machine vision applications typically rely on more automated methods of identifying characteristic features and object edges in a captured image. Additionally if a machine is expected to interact with an object in an assembly line, such as if a robot is intended to pick up a specific type of wire harness from a bin of multiple wire harnesses and attach (i.e., plug) a specific end of the harness to a specific receptacle, the machine will need some sort of depth perception to properly manipulate the robot.
It is further noted that edge detection algorithms are part of many image manipulation operations. Edge detection is fundamental to image processing and computer vision, particularly in the areas of feature detection and feature extraction. Edge detection aims to identify points, i.e., pixels that outline objects within an image. There are many edge detection algorithms, but generally they attempt to identify pixels at which discontinuities occurs, i.e., where the image brightness changes sharply. In the ideal case, the result of applying an edge detector to an image leads to a set of connected curves that indicate the boundaries of objects, the boundaries of surface markings, and discontinuities in surface orientation. Once the boundaries have been identified, various image processing operations may be applied to the digital image.
For example FIG. 3A shows a typical digital image, and FIG. 3B shows the results of applying edge detection to the image of FIG. 3A. Edge detection may be designed to identify thick or thin lines, or may be optimized to separately identify thick and thin lines. In the example of FIG. 3B, both thick and thin lines are separately identified, which permits them to be separately processed. This permits the processing of the digital image to be more specialized by adjusting the size of a pixel-processing window according to line thickness. As a result, application of a specific image processing algorithms, such a bilateral filter, may be optimized along the edge of objects according to line thickness to achieve a sharper final image, as shown in FIG. 3C.
Another use of edge detection is feature detection. As an example, if one has a library of identifying features of a specific object, then one may search an input digital image for those identifying features in an effort to determine if an example of the specific object is present in the input digital image. When this is extended to multiple digital images of a common scene taken from different view angles, it is possible to index, i.e., match or correlate, feature points from one image to the other. This permits the combined processing of the multiple digital images.
For example in FIG. 4, images 7A, 7B, 7C and 7D each provide partial, and overlapping, views of a building in a real-world scene, but none provide a full view of the entire building. However, by applying edge detection and indexing (i.e., identifying matching pairs of) feature points in the four partial images 7A, 7B, 7C and 7D that correlate to the same real feature point in the real-world scene, it is possible to stitch together the four partial images (i.e., applying an image stitching tool) to create one composite image 7E of the entire building. The four partial images 7A, 7B, 7C and 7D of FIG. 4 are taken from the same view angle, but this approach may be extended to the field of correspondence matching, where images of a common scene are taken from different view angles.
Images of a common scene are taken from different view angles are the basis for stereo vision and depth perception. In this case, corresponding feature points in two images taken from different view angles (and/or different fields of vision) of the same subject (or scene) can be combined to create a perspective view of the scene. Thus, imaging a scene from two different view points (i.e., from two different fields of vision, FOV) creates stereo vision, which provides depth information about objects in the scene.
This ability would be particularly helpful in the field of robotics and automated assembly/construction. In these applications, a machine having stereo vision and the ability to discern (i.e., identify) target items would have the ability to independently retrieve the target item and use it in an assembly.
Implementing such vision capabilities, however, is still a challenge, even in a specialized assembly line where the number of possible target object variants is limited. The challenges become even more daunting when the target objects are amorphous, or otherwise prone to change in shape and/or appearance, such as in the case of wire harnesses.
It is an object of the present invention to provide a system for identifying and manipulating cable harnesses for use in robotic assembly lines.
It is a further object of the present invention to make use of 3D information for determining pose information of cable harnesses.
It is a further object of the present invention to provide a 3D visual system suitable for use in a robotic assembly line.