A frequent problem in computer vision applications is to determine poses of objects in 3D scenes from scene data acquired by 3D sensors based on structured light or time of flight. Pose estimation methods typically require identification and matching of scene measurements with a known model of the object.
Some methods are based on selecting relevant points in a 3D point cloud and using feature representations that can invariantly describe regions near the points. Those methods produce successful results when the shape of the object is detailed, and the scene measurements have a high resolution and little noise. However, under less ideal conditions, the accuracy of those methods decreases rapidly. The 3D measurements can include many hidden surfaces due to imaging from a single viewpoint with the sensor, which makes a detailed region representation unavailable. Noise and background clutter further affect the accuracy of those methods.
A set of pair features can be use for detection and pose estimation. Pairs of oriented points on a surface of an object are used in a voting framework for pose estimation, e.g., see U.S. Publication 20110273442. Even though the descriptor associated with the pair feature is not very discriminative, that method can produce accurate results even when subject to moderate occlusion and background clutters by accumulating measurements for a large number of pairs. That framework can benefit from a hashing and Hough voting scheme.
Other methods model 3D shapes globally by using 2D and 3D contours, shape templates, and feature histograms. In general, global methods require the object to be isolated because those methods are sensitive to occlusion. Also, changes in appearance due to pose variations necessitates the use of a large number of shape templates, which has the drawback of increased memory and processing time. A learning-based keypoint detector that uses range data to decrease processing time is described in U.S. Publication 20100278384.