3D object recognition is part of many computer vision applications. Compared to image-based computer vision that deals with planar intensity images, 3D computer vision deals with three-dimensional information and is especially important where non-planar objects and surfaces need to be inspected or manipulated. Many different methods and sensors were developed for acquiring the 3D surface information of a scene. Many of those methods return a so-called range image, which is an image where the value at each point represents the distance of the scene surface from the camera. If the sensor is calibrated and its internal parameters are known, the range image can be transformed into a 3D scene where the X-, Y- and Z-coordinate of every point is known. Additionally, the information from multiple sensors can be fused to get a 3D scene that can not be expressed as range image. Contrary to prior art, the disclosed method is able to recognize free-form objects of any shape in arbitrary 3D scenes and does not require an approximate pose as a-priori information.
Descriptor- or feature-based techniques are based on finding correspondences between 3D points in the scene and 3D points on the object by using surface descriptors. Surface descriptors express the surface around a point on that surface using a low-dimensional representation. Typically, the surface descriptors are calculated for all points on the surface of the object and stored in a database. For recognizing the object in a scene, the surface descriptors are calculated for points in the scene and corresponding object points are searched using the pre-computed database. Once enough correspondences were found, the pose of the object can be recovered. Extensive overviews of different surface descriptors are given in Campbell and Flynn (A Survey Of Free-Form Object Representation and Recognition Techniques, 2001, Computer Vision and Image Understanding, Vol. 81, Issue 2, pp. 166-210), Mamic and Bennamoun (Representation and recognition of 3D free-form objects, 2002, Digital Signal Processing, Vol. 12, Issue 1, pp. 47-76) and Mian et al. (Automatic Correspondence for 3D Modeling: An Extensive Review, 2005, International Journal of Shape Modeling, Vol. 11, Issue 2, p. 253).
Several drawbacks are associated with approaches that rely on correspondence search with local descriptors: First, local descriptors cannot discriminate between similar surface parts on an object, such as larger planar patches. Such similar parts lead to equal or similar local descriptors and in turn to incorrect correspondences between the scene and the object. Increasing the radius of influence such that non-similar surface parts are included in the construction of the descriptor leads to sensitivity against missing parts of the surface, which are frequent in case of occlusion or sensor problems. Second, local descriptors are typically too slow to be implemented in real-time systems and require processing times of several seconds. Third, local descriptors are sensitive to clutter, i.e. scene parts that do not belong to the object of interest. Furthermore, local descriptors require a dense representation of the 3D scene data, which is often not available. Finally, the present descriptors are not fast enough for real-time applications.
Several approaches use so-called geometric primitives to detect an object in a scene. A geometric primitive is a simple geometric object, such as a plane, a cylinder or a sphere. Compared to free-form objects, geometric primitives are easier detectable in a scene due to their intrinsic symmetries. Several methods exist that detect primitives or objects composed of geometric primitives in scenes. In EP-A-2 047 403, the 3D object is partitioned into geometric primitives. Such geometric primitives are then searched for in the 3D scene, and the object is recognized by identifying primitives in the scene that are similar to primitives in the object. Other methods use a variant of the generalized Hough transform to detect geometric primitives in the scene, for example Katsoulas (Robust extraction of vertices in range images by constraining the hough transform, 2003, Lecture Notes in Computer Science, Vol. 2652, pp. 360-369), Rabbani and Heuvel (Efficient hough transform for automatic detection of cylinders in point clouds, 2005, Proceedings of the 11th Annual Conference of the Advanced School for Computing and Imaging (ASCI'05), pp. 60-65), and Zaharia and Preteux (Hough transform-based 3D mesh retrieval, 2001, Proceedings of the SPIE Conf. 4476 on Vision Geometry X, pp. 175-185). All methods that rely on geometric primitives have the disadvantage that they do not work with general free-form objects.
Several methods for recognizing 3D objects in range images were developed, for example in EP-A-1 986 153. Such methods work on single range images, as returned by a variety of sensor systems. However, all range image based systems are limited to the 3D information acquired from a single range image and cannot cope with general 3D information from other 3D sensors or from the combination of different range images. Additionally they are not fast enough for real-time systems, as they typically require a brute-force search in the parameter space.
Several methods for refining a known 3D pose of an object are known. Such methods require as input an approximate 3D pose of the object in the scene, and increase the accuracy of that pose. Several such methods were developed, such as Iterative Closest Points (see for example Zhang (Iterative point matching for registration of free-form curves, 1994, International Journal of Computer Vision, Vol. 7, Issue 3, pp. 119-152), EP-A-2 026 279 and Fitzgibbon (Robust registration of 2D and 3D point sets, 2003, Image and Vision Computing, Vol. 21, Issue 13-14, pp. 1145-1153)). The major disadvantage of pose refinement is that the input pose needs to be close enough to the correct pose or otherwise the methods will fail to converge. Good approximates of the correct pose are, however, difficult to obtain for 3D scenes that in practical applications often contain clutter, occlusion and noise.
For the foregoing reasons, there is a need for a method that allows efficient recognition of arbitrary free-form 3D objects and recovery of their 3D pose in general 3D scenes.