A common task in computer vision applications is to estimate a pose of objects from images acquired of a scene. Herein, pose is defined as the 6-DOF location and orientation of an object. Pose estimation in scenes with clutter, e.g., unwanted objects and noise, and occlusions, e.g., due to multiple overlapping objects, can be quite challenging. Furthermore, pose estimation in 2D images and videos is sensitive to illumination, shadows, and lack of features, e.g., objects without texture.
Pose estimation from range images, in which each pixel includes an estimate of a distance to the objects, does not suffer from these limitations. Range images can be acquired with active light systems, such as laser range scanners, or active light stereo methods. Range images are often called range maps. Hereinafter, these two terms are synonymous.
If a 3D model of the objects is available, then one can use model-based techniques, where the 3D model of the object is matched to the images or range images of the scene. Model-based pose estimation has been used in many applications such as object recognition, object tracking, robot navigation, and motion detection.
The main challenge in pose estimation is invariance to partial occlusions, cluttered scenes, and large pose variations. Methods for 2D images and videos generally do not overcome these problems due to their dependency on appearance and sensitivity to illumination, shadows, and scale. Among the most successful attempts are methods based on global appearance, and methods based on local 2D features. Unfortunately, those methods usually require a large number of training examples because they do not explicitly model local variations in the object structure.
Model-based surface matching techniques, using a 3D model have become popular due to the decreasing cost of 3D scanners. One method uses a viewpoint consistency constraint to establish correspondence between a group of viewpoint-independent image features and the object model, D. Lowe, “The viewpoint consistency constraint,” International Journal of Computer Vision, volume 1, pages 57-72, 1987. The most popular method for aligning 3D models based purely on the geometry is the iterative closest point (ICP) method, that has recently been improved by using geometric descriptors, N. Gelfand, N. Mitra, L. Guibas, and H. Pottmann, “Robust global registration,” Proceeding Eurographics Symposium on Geometry Processing, 2005. However, those methods only address the problem of fine registration where an initial pose estimate is required.
Geometric hashing is an efficient method for establishing multi-view correspondence and object pose due to its insensitivity of the matching time to the number of views. However, the building of the hash table is time consuming and the matching process is sensitive to image resolution and surface sampling.
Another method matches 3D features, or shape descriptors, to range images using curvature features by calculating principal curvatures, Dorai et al., “Cosmos—a representation scheme for 3d free-form objects,” PAMI, 19(10): 1115-1130, 1997. That method requires the surface to be smooth and twice differentiable and thus is sensitive to noise. Moreover, occluded objects can not be handled.
Another method uses “spin-image” surface signatures to image a surface to a histogram, A. Johnson and M Hebert, “Using spin images for efficient object recognition in cluttered 3d scenes,” PAMI, 21(5):433-449, 1999. That method yields good results with cluttered scenes and occluded objects. But their method is time-consuming, sensitive to image resolution, and might lead to ambiguous matches.
Another method constructs a multidimensional table representation, referred to as tensors, from multiple unordered range images, and a hash-table based voting scheme is used to match the tensor to objects in a scene. That method is used for object recognition and image segmentation, A. Mian, M. Bennamoun, and R. Owens, “Three-dimensional model-based object recognition and segmentation in cluttered scenes,” PAMI, 28(12): 1584-1601, 2006. However, that method requires fine geometry and has runtime of several minutes, which is inadequate for real-time applications.