Object recognition is part of many computer vision applications. In some cases, the object is assumed to be planar and the transformation of the object in the image is limited to a certain degree, for example, to similarity transformations or projective transformations. There is a multitude of matching approaches of various types available in literature that are already able to solve this task. A survey of matching approaches is given by Brown (1992). In most cases, the model of the object is generated from an image of the object. Two examples for such approaches that fulfill the requirements for industrial applications, i.e., fast computation, high accuracy, robustness to noise, object occlusions, clutter, and contrast changes are presented in EP 1,193,642 and by Ulrich et al. (2003).
However, in many applications the object to be recognized is not planar but has a 3D shape and is imaged from an unknown viewpoint, because the object moves in 3D space in front of a fixed camera, the camera moves around a fixed object, or both, the object as well as the camera move simultaneously. This complicates the object recognition task dramatically because the relative movement between camera and object results in different perspectives that cannot be expressed by 2D transformations. Furthermore, not only a 2D transformation has to be determined but the full 3D pose of the object with respect to the camera. The 3D pose is defined by the six parameters of the 3D rigid transformation (three translation and three rotation parameters), which describes the relative movement of the object with respect to the camera. Different techniques have been developed for visually recognizing a 3D object in one image. They can be grouped into feature-based techniques and view-based techniques. Besides these approaches, there are approaches that use more information than only one image to recognize 3D objects, e.g., two images (e.g., Sumi and Tomita, 1998) or one image in combination with a range image (e.g., US 2005/0286767). The latter approaches are not discussed here, because they differ too much from this invention.
Feature-based techniques are based on the determination of correspondences between distinct features of the 3D object and their projections in the 2D search image. If the 3D coordinates of the features are known, the 3D pose of the object can be computed directly from a sufficiently large set (e.g., four points) of those 2D-3D correspondences.
In one form of the feature-based techniques, distinct manually selected features of the 3D object are searched in the 2D search image (e.g., U.S. Pat. No. 6,580,821, U.S. Pat. No. 6,816,755, CA 2555159). The features can be either artificial marks or natural features, e.g., corner points of the 3D object or points that have a characteristically textured neighborhood. Typically, templates are defined at the position of the features in one image of the object. In the search image, the features are searched with template matching. Several drawbacks are associated with these approaches: In general, it is difficult to robustly find the features in the image because of changing viewpoints, which results in occluded and perspectively distorted features. Template matching methods cannot cope with this kind of distortions. Consequently, these approaches are only suited for a very limited range of viewpoint changes. In addition, marker-based approaches are not flexible with regard to changing objects. It is often difficult to add the markers and to measure their 3D coordinates. Furthermore, many objects are not suited for adding markers to their surface.
Another form of feature-based recognition techniques eliminates this restriction by using features that are invariant under perspective transformations (e.g., US 2002/0181780, Beveridge and Riseman, 1995, David et al., 2003, Gavrila and Groen, 1991). For example, in Horaud (1987), linear structures are segmented in the 2D search image and intersected with each other to receive intersection points. It is assumed that the intersection points in the image correspond to corner points of adjacent edges of the 3D model. To obtain the correct correspondences between the 3D corner points of the model and the extracted 2D intersection points several methods are available in literature (Hartley and Zisserman, 2000, US 2002/0181780). The advantage of these feature-based approaches is that the range of viewpoints is not restricted.
Furthermore, there are generic feature-based approaches which are able to detect one kind of 3D object without the need for a special 3D model of the object. One example is given in U.S. Pat. No. 5,666,441, where 3D rectangular objects are detected. First, linear structures are segmented in the image. Intersections of at least three of these linear structures are formed and grouped together in order to detect the 3D rectangular objects. Because no information about the size of the object is used, the pose of the object cannot be determined with this approach. Naturally, these kinds of feature-based approaches are not flexible with regard to changing objects. They can detect only those objects for which they are developed (3D rectangular objects in the above cited example).
In general, feature-based recognition techniques suffer from the fact that the extraction of the features cannot be carried out robustly with respect to clutter and occlusions. Furthermore, the correct assignment of the extracted 2D features to the 3D features is a NP complete combinatorial problem, which makes these techniques not suited for industrial applications, where a fast recognition is essential.
View-based recognition techniques are based on the comparison of the 2D search image with 2D projections of the object seen from various viewpoints. The desired 3D pose of the object is the pose that was used to create the 2D projection that is the most similar to the 2D search image.
In one form of the view-based recognition, a model of the 3D object is learned from multiple training images of the object taken from different viewpoints (e.g. U.S. Pat. No. 6,526,156). The 2D search image is then compared to each of the training images. The pose of the training image that most resembles the 2D search image is returned as the desired object pose. Unfortunately, the acquisition of the training images and their comparison with the 2D search image is very costly because of the very large number of training images that are necessary to cover a reasonably large range of allowed viewpoints. What is more, this form of view-based recognition is typically not invariant to illumination changes, especially for objects that show only few texture. These problems make this approach not suited for industrial applications.
In another form of the view-based recognition, the 2D projections are created by rendering a 3D model of the 3D object from different viewpoints (e.g., U.S. Pat. No. 6,956,569, US 2001/0020946), CA 2535828). Again, there is the problem of the very large number of 2D projections that is necessary to cover a reasonably large range of allowed viewpoints. To cope with this, pose clustering techniques have been introduced (e.g., Munkelt, 1996). But even then, the number of 2D projections that must be compared with the 2D search image remains too large, so that these view-based recognition techniques are not suited for industrial applications. Often the number of views is reduced by creating the views such that the camera is always directed to the center of the 3D object, but then, objects appearing not in the center of the image cannot be found because of the resulting projective distortions. Another unsolved problem of these view-based recognition techniques is the creation of the 2D projections such that they are suitable for the comparison with the 2D search image. Approaches that use a realistically rendered 2D projection (U.S. Pat. No. 6,956,569) are not invariant to illumination changes because the appearance of object edges varies with the illumination direction. This problem can be reduced, but not eliminated, by the use of texture (US 2001/0020946). Other approaches create a model by extracting feature points in the images of the different sampled viewpoints and train a classifier using a point descriptor (e.g., Lepetit, 2004). Also in the search image, feature points are extracted and classified using the output of the point descriptor. Finally, the most likely 3D pose is returned. Unfortunately, this kind of approaches strongly relies on a distinct texture on the object's surface, and hence is not suitable for most industrial applications. Approaches that use only a wireframe projection of the 3D model face the problem that many of the projected edges are not visible in the search image, especially on slightly curved surfaces, which are typically approximated by planar triangles in the 3D model of the object. Often, the techniques that are used for the comparison of the 2D projections with the 2D search image are not robust against clutter and occlusions (Ulrich, 2003). Finally, the accuracy of the object pose determined by pure view-based approaches is limited to the distance with which the allowed range of viewpoints is sampled.