The pose of an object has 6-degree-of-freedom (6-DoF), i.e., 3D translation and 3D rotation. The problem of pose estimation refers to finding the pose of an object with respect to a reference coordinate system, usually the coordinate system of a sensor. The pose can be acquired using measurements from sensors, e.g., 2D images, 3D point clouds, with 3D models. Pose estimation plays a major role in many robotics applications such as bin picking, grasping, localization, autonomous navigation, and 3D reconstruction.
Until recently, pose estimation was primarily done using 2D images because cameras are cost effective and allow fast image acquisition. The main problem with 2D images is to match the 2D features with their corresponding 3D features in the model. This becomes challenging due to various illumination conditions and different viewpoints of the camera, causing changes in rotation and scale in the image space. Furthermore, some views of the object can theoretically lead to ambiguous poses. Several invariant feature descriptors are known to determine the correspondences between an input image and a database of images, where the 2D keypoints are matched with the 3D coordinates.
Many industrial parts are textureless, e.g., machined metal parts or molded plastic parts. Therefore, one has to rely heavily on observable edges in the images. When boundaries of an object are used, a set of edge templates of an object is often known a priori, and the templates are searched in query edge maps. Several variants that incorporate edge orientation or hierarchical representation are known. Intensity-based edge detection often yields too many edge pixels where only a few are useful edges coming from depth discontinuities. A multi-flash camera can be used to directly estimate depth edges by casting shadows from multiple flash directions.
Generally, 3D data obtained with 3D sensors have a lot less variants in contrast to 2D data. The main challenge is to solve the correspondence problem in the presence of sensor noise, occlusions, and clutter. The correspondence problem refers to finding a one-to-one matching between features in the data and features in the model. The features are usually constructed to characterize the size and shape of the object. Several 3D feature descriptors, using the distributions of surface points and normals, and matching procedures are known. Those descriptors are generally invariant to rigid body transformation, but sensitive to noise and occlusion. Furthermore, those features require dense point clouds, which may not be available.
Pose estimation is feasible with various kinds of correspondences between sensor 3D data and the model: 3 point correspondences, 2 line correspondences, and 6 points to 3 or more planes. Typically those correspondences are used in a hypothesize-and-test framework such as RANdom SAmple Consensus (RANSAC) to determine the pose. Alternatively, the pose can be retrieved from the mode of the hypothesized pose distribution either using a Hough voting scheme, or clustering in the parameter space. Those approaches suffer from two problems when only 3D sensor data are available without images or other prior information. Points, lines, and planes are not very discriminative individually and are combinatorial to match, and it is difficult to achieve fast computation without doing any prior processing on the model.
A pair feature can be defined by a distance and relative orientations between two oriented points on the surface of an object. An object is represented by a set of oriented point pair features, which is stored in a hash table  for fast retrieval. Random two points are sampled from the sensor data and each such pair votes for a particular pose. The required pose corresponds to the one with a largest number of votes. A simpler pair feature consisting of the depth difference between a pixel and an offset pixel is used for human pose estimation with a random forest ensemble classifier.