Machine vision has become an essential component in many modern manufacturing processes. One particular use for machine vision is to determine the alignment or pose of a particular component or surface so that the component or surface can be operated on by a tool or robotic manipulator that requires knowledge as to how to orient itself to engage and pick up the component, or perform work on the component or surface. For example, a parts picker that lifts bolts from a bin requires knowledge as to the location of the head of the bolt and the direction in which it is oriented to properly grasp the bolt and direct it to a given target bolt hole in a device under construction.
The power and usefulness of vision systems for use in manufacturing and other applications has increased in recent years due to significant increases in computing power. Capabilities that were unavailable only a few years ago are now available in relatively basic systems.
While current commercially available vision systems are extremely effective for determining alignment in a wide range of applications, they typically rely upon a two-dimensional, or “2D” (e.g. x, y), representation of the viewed subject. That is, the acquired pixels constituting an image of the subject are arranged in a two-dimensional array of pixels. Each pixel can be addressed by its (x, y) coordinates. The value of each pixel in the image is a grayscale value representing the amount of light striking the corresponding sensing element in the camera. Such two-dimensional representations are processed with respect to a model in an image field consisting of x and y coordinates. However, the alignment of many objects is not completely resolvable in only two dimensions due to their geometry and surface coloration/shading. In many cases the geometrical complexity of the object, and/or the need to accurately align with respect to an element of the object that projects in a third dimension may limit the effectiveness of algorithms and tools that are based on the acquisition of two-dimensional images. Additionally, many objects appear very differently in a two-dimensional image after undergoing only small amounts of tilt relative to the camera and its associated image plane. Thus, the alignment of an object that was clearly recognized by the system in one orientation may be less-recognizable or unrecognizable to the system in a slightly different orientation.
The majority of imaging systems today acquire two-dimensional images of a three-dimensional (“3D”) scene or object. That is, a three-dimensional geometric shape is resolved by the system into a two-dimensional image. A significant amount of useful information about a 3D scene or 3D object is lost when that scene or object is projected onto a 2D image. That lost information is the distance which the various parts of the scene or object are from the camera, and is typically termed “depth information” or “depth data”. The loss of this depth data may make it significantly more difficult to accurately and robustly determine the 3D pose of objects.
Currently, there are commercially available devices that allow acquisition of visual data in order to produce a 3D representation (depth data) of the above-described 3D scene or object. Such devices are herein termed “3D sensors”. A popular type of 3D sensor presently in use is a stereo camera head. Stereo camera heads are generally comprised of multiple 2D cameras arranged in a predetermined, typically fixed orientation with respect to each other. Each of the 2D cameras acquires a 2D image of the 3D scene or 3D object from a different vantage point with respect to the scene or object.
Several techniques can be employed by 3D sensors, in determining the depth data. One technique measures the delay of time between transmission of a light pulse and receipt of the reflected light pulse—a technique called Light Detection and Ranging or LIDAR. In alternate examples, structured light, or devices that employ a scanning laser can also be used to generate depth data. A particular depth data-determination technique employs triangulation. This technique locates a feature in the scene or on the object in two or more of the images respectively acquired from each of 2D cameras, and using the relative position of the feature in each of the images, triangulation is performed to recover the depth information for that feature. In the particular example of a stereo camera head, the output of each 2D camera is in fact a 2D array of pixel values (image) and associated intensity. The 2D array of pixel values is combined using geometric algorithms to generate the corresponding z (depth) value for each pixel. The z values for each pixel are typically stored in a depth image. Depth images are typically the same size in width and height as the acquired grayscale or color image but their pixel values represent depth or distance from the camera. The depth or distance from the camera dimension provides the z-component of a 3D representation of a scene or object. However, the process of computing z values consumes additional time when compared with the acquisition of only a two-dimensional image. From acquired depth images, found 3D points of the representation can be derived. These found 3D points can be used in subsequent processes.
After acquiring a 3D representation of the scene or object, the remaining task in 3D alignment is to determine the best transform between a pre-existing 3D model of the scene or object to the acquired scene or object. The model can be provided by acquiring images of the scene or object at known alignments and/or can be provided synthetically, by entering the locations of various features as data points. The transform between the 3D model data and the 3D acquired data is the pose, and is the goal of 3D alignment.
A “brute force” approach to computing the pose is to iterate over the possible point-to-point correspondences between the found 3D points and a set of model 3D points (provided by any acceptable technique), and then for each set of correspondences compute the pose which best aligns the found points to the model points. Typical 3D representations produced by current 3D sensors are most often in the form of 3D point clouds. These 3D point clouds often contain thousands to hundreds of thousands of 3D points. The large number of found 3D points, and the significantly large number of possible correspondences between found points and model points, renders a brute force approach to determining the pose intractable.
Achieving a rapid, robust, and accurate 3D alignment solution using a 3D sensor system is a technically challenging problem. In an industrial setting, the solution must be achieved accurately and quickly for each object being aligned. The availability of higher-power computing systems offers opportunities to address this problem. Thus, it is desirable to provide a system and method for 3D alignment of objects that is robust, efficient and reliable, and that accommodates the additional processing overhead encountered in 3D image acquisition and processing. This system and method should enable accurate alignment of a large variety of 3D objects, and should enable such alignment at speeds that accommodate the normal rate of operation on a manufacturing production line or other industrial environment.