Object recognition and localization is part of many machine vision applications. Knowing the precise location of the object of interest in the scene is crucial for any subsequent manipulation and inspection tasks. Many different techniques were developed to find objects in intensity images or 3D scans of a scene, a task commonly referred to as matching. This document describes a matching approach that uses simultaneously both intensity images and 3D scans of a scene to find an object by optimizing the consistency between model and scene in both the 3D data and the intensity data.
Descriptor- or feature-based techniques are based on finding correspondences between points in the scene and points on the model by using descriptors. Such descriptors express the 3D surface or the intensities around a given scene point using a low-dimensional representation. Such descriptors are typically computed off-line for all or a selected set of points of the model and stored in a database. For recognizing the object in a scene, the descriptors are calculated for points in the scene and corresponding model points are searched using the pre-computed database. Once enough correspondences were found, the pose of the object can be recovered. Extensive overviews of different 3D surface descriptors are given in Campbell and Flynn (A Survey Of Free-Form Object Representation and Recognition Techniques, 2001, Computer Vision and Image Understanding, Vol. 81, Issue 2, pp. 166-210), Mamic and Bennamoun (Representation and recognition of 3D free-form objects, 2002, Digital Signal Processing, Vol. 12, Issue 1, pp. 47-76) and Mian et al. (Automatic Correspondence for 3D Modeling: An Extensive Review, 2005, International Journal of Shape Modeling, Vol. 11, Issue 2, p. 253). Commonly used feature descriptors in intensity data include edges, as described in Canny (A Computational Approach To Edge Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679-698, 1986), the SIFT keypoint descriptor by Lowe (Object recognition from local scale-invariant features, Proceedings of the International Conference on Computer Vision. 2. pp. 1150-1157; see also U.S. Pat. No. 6,711,293), and the SURF keypoint descriptor by Bay et al. (SURF: Speeded Up Robust Features, Computer Vision and Image Understanding (CVIU), 2008, Vol. 110, No. 3, pp. 346-359; see also U.S. Pat. No. 8,165,401). Many other feature point descriptors were proposed in the literature. Methods that rely on feature descriptors often do not work it the object of interest has little distinctive shape or intensity information, because descriptors are then less discriminative.
Several approaches use so-called geometric primitives to detect an object in a scene. A geometric primitive is a simple geometric object, such as a plane, a cylinder or a sphere. Compared to free-form objects, geometric primitives are easier detectable in a scene due to their intrinsic symmetries. Several methods exist that detect primitives or objects composed of geometric primitives in scenes. In EP-A-2 047 403, the 3D object is partitioned into geometric primitives. Such geometric primitives are then searched for in the 3D scene, and the object is recognized by identifying primitives in the scene that are similar to primitives in the object. Other methods use a variant of the generalized Hough transform to detect geometric primitives in the scene, for example Katsoulas (Robust extraction of vertices in range images by constraining the hough transform, 2003, Lecture Notes in Computer Science, Vol. 2652, pp. 360-369), Rabbani and Heuvel (Efficient hough transform for automatic detection of cylinders in point clouds, 2005, Proceedings of the 11th Annual Conference of the Advanced School for Computing and Imaging (ASCI'05), pp. 60-65), and Zaharia and Preteux (Hough transform-based 3D mesh retrieval, 2001, Proceedings of the SPIE Conf. 4476 on Vision Geometry X, pp. 175-185). All methods that rely on geometric primitives have the disadvantage that they do not work with general free-form objects.
Several methods for recognizing 3D objects in range images were developed, for example in EP-A-1 986 153. Such methods work on single range images, as returned by a variety of sensor systems. However, all range image based systems are limited to the 3D information acquired from a single range image and cannot cope with general 3D information from other 3D sensors or from the combination of different range images. Additionally they are not fast enough for real-time systems, as they typically require a brute-force search in the parameter space.
Several methods for refining a known 3D pose of an object are known. Such methods require as input an approximate 3D pose of the object in the scene, and increase the accuracy of that pose. Several such methods were developed that optimize the pose using 3D data only, such as Iterative Closest Points (see for example Zhang (Iterative point matching for registration of free-form curves, 1994, International Journal of Computer Vision, Vol. 7, Issue 3, pp. 119-152), EP-A-2 026 279 and Fitzgibbon (Robust registration of 2D and 3D point sets, 2003, Image and Vision Computing, Vol. 21, Issue 13-14, pp. 1145-1153)). Other methods for refining the pose of an object use only intensity data, such as Wiedemann et al. (Recognition and Tracking of 3D Objects, IEEE International Conference on Robotics and Automation 2009, 1191-1198). The major disadvantage of pose refinement is that the input pose needs to be close enough to the correct pose or otherwise the methods will fail to converge. Good approximates of the correct pose are, however, difficult to obtain for 3D scenes that in practical applications often contain clutter, occlusion, noise, and multiple object instances.
For the foregoing reasons, there is a need for a method that allows efficient recognition of arbitrary free-form 3D objects and recovery of their 3D pose in scenes which were captured with one or more intensity images and and ore more 3D sensors. For the purpose of this document, scenes captured with intensity images and 3D sensors will also be called multimodal scenes.