Using a database of 3D models of objects, it is desired to provide a method for detecting objects in a query 2.5 D range image acquired by a scanner of a 3D scene. In the 2.5D range image, every scanned point (pixel) (x, y) on a surface of an object is associated with one depth value z, i.e., where z is the distance from the scanner to the point.
Object Detection
As defined herein, object detection generally includes object shape matching, object recognition, and object registration.
Point Cloud
A point cloud is a set of vertices in a three-dimensional coordinate system. The vertices are usually defined by (x, y, z) coordinates, and typically represent the external surface of the object. The point clouds used herein are generated by a scanner. Scanners automatically measure distances to a large number of points on the surface of the object, and output the point cloud as a data file. The point cloud represents the set of points measured by the scanner. Point clouds are used for many purposes, including object detection as defined herein.
Prior art object detection methods generally assume the availability of a 3D surface mesh, and complete 3D models, and therefore those methods cannot be readily extended to 2.5 range images. It is a difficult to detect a 3D object in 2.5 D range images for the following reasons.
Parts of objects can be obscured due to self-occlusion, or occlusion by other objects. Scanners, at most, can only acquire a 180° degree view of a 360° 3D scene, i.e., only half of the scene is visible in the range image at most.
Nearby objects can also act as background clutter interfering with the detection method. Viewpoint and scale changes exhibit high appearance variation and ambiguity. This variation sometimes goes well beyond inter-class changes contributing to the detection inaccuracy.
Range Images
Range scanners have a limited spatial resolution because the surface is only scanned at discrete points, and fine details in the objects is usually lost or blurred. For some scanners, the sampling resolution varies greatly along different axes, and re-sampling of a 3D point clouds is difficult and possibly leads to distortion of the surface topology.
High-speed range scanners introduce significant noise in the range measurement, causing parts of the scene having incomplete observations.
Regardless of the above difficulties, the use of scanner generated point clouds has become increasingly popular due to many advantages over traditional optical counterparts, such as conventional cameras. In general, methods for 2.5 range images are generally illumination-invariant, because only geometric distances matter.
Feature Descriptor
The most popular object descriptors for object detection methods are feature-based, which require compact and effective 3D descriptors. The efficacy of those methods is based on several criteria including discriminative power, rotation invariance, insensitivity to noise, and computational efficiency.
Feature-based methods can be partitioned into the following categories depending on a size of the support regions: global descriptors, regional descriptors, and local descriptors. However, local descriptors are not useful for recognition and detection from discretely scanned points because the estimate of local properties such as surface normals, or curvature from a set of discrete sample points, is very unstable.
Global Descriptors
An extended Gaussian image (EGI) is among the most popular global descriptor. EGI maps weighted surface normals to a Gaussian sphere, which forms a 2D image. The simplicity of this descriptor comes at the cost of a loss of local geometry information.
A shape distribution method randomly samples pair-wise distances of points and forms a histogram representing the overall shape. This descriptor is advantageous because it can be determined quickly, and does not require pose normalization, feature correspondence, or model fitting.
Other global shape features include superquadratic, spherical attribute images, and the COllaborative System based on MPEG-4 Objects and Streams, (COSMO). Global shape descriptors are generally more discriminative because they use the entire model. On the other hand, these models are very sensitive to clutter or occlusion.
Regional Descriptors
Among regional descriptors, a spin image is effective in many 3D applications. The spin image considers a cylindrical support region whose center at the basis point p and its north pole oriented with the surface normal estimate at point p. The two cylindrical coordinates are: radial coordinate α, perpendicular distance to the center, and elevation coordinate β, perpendicular signed distance to the tangent plane going through the point p. The spin image is constructed by accumulating points within volumes indexed by (α, β). Other regional descriptors include surface splashes and super segments.
A 3D shape context is similar to the spin image except that the support region is a sphere. The sphere is segmented into sub-volumes by partitioning the sphere evenly along the azimuth and elevation dimensions, and logarithmically in the radial dimension. The accumulation of weights for each sub-volume contributes one histogram bin. A degree of freedom in the azimuth direction is removed before performing feature matching. A spherical harmonic can be applied to the shape context to make it rotation-invariant. That method is called spherical shape context.
A point signature represents local topologies by distances from 3D curves to a plane. Although less descriptive than the spin image or the shape context, this 1D descriptors is advantageous in the sense that it is quick to determined and easy to match. It does not require normal estimate like the spin image, which can be erroneous when the point density is insufficient. It also does not vary with pose like the shape context. In addition, a combination of signatures across different scales can produce a more complete descriptors.
Given numerous available 3D descriptors, it makes sense to select a descriptor having feature that best fit an application. It is sometimes more efficient to combine different type of features and allow each feature to contribute at different stages in an application.
For example, spin images and EGI have been combined in a top-down and bottom-up manner. That method first classifies points as an object or background using spin images. Connected components of neighboring object points are then extracted. Constellation EGIs facilitates the fast alignment and matching of EGIs of connected components to a model database. This provides a good trade-off between efficiency and accuracy for detecting cars and other objects in a large dataset. Principal curvature and point signature have also been combined for 3D face recognition.
Arrangement of features along the detection and recognition cascade is dictated mostly by heuristic rules. For each query image, there can be hundreds of thousands of points. The huge amount of data requires efficient techniques for retrieving the best matches from the model database. One method uses principle component analysis (PCA) to determine a subspace of spin images.
Another method uses quantization and clusters the feature space. That method uses k representative clusters to facilitate fast d-dimension feature retrieval, where k is substantially smaller than d. That method can partially match of objects by projecting a query histogram onto object subspaces. A coarse-to-fine approach can further reduce the amount of computation. Only a small subset of features is selected from the query image to compare with the models in the database. The selection can be random, based on local topologies such as curvatures or normal directions, or data driven. The matching qualities of features to the models dictate a short list of candidate positions. At the end of the coarse-to-fine chain, there are fewer candidate objects, therefore more complex search and geometric constraints can be enforced.
Another method for feature retrieval uses hashing. Geometric hashing. That method combines invariant coordinate representations with geometric coordinate hashing to prune a model database using simple geometric constraints. That method is polynomial in the number of feature points. A sublinear feature retrieval method uses locality sensitive hashing (LSH), which is a probabilistic nearest neighbor search. In that method features are determined at salient points on surfaces. LSH hashes features into bins based on probability of collision so that similar features hash to same bucket.
Point Signature
A point signature (PS) is a shape descriptor based on a 3D space curve formed by an intersection between a sphere centered at a center point and the surface of the object. The PS is fast to determined, and easy to match with the models. However, PS lacks sufficient discriminant power for reliable matching.
Other combinations of shape descriptors such as the spin image, shape context, and their spherical harmonics are effective in many applications. In contrast to the over-simplification of the PS, those descriptors store a weight proportional to the number of points in a given volume. Those descriptors can be categorized as volume-based descriptors, which inevitably lead to high redundancy because of the nature of range images, which are necessarily sparse. In addition, spin image and shape context require an estimation of normal vector at local points, which can be error-prone if the spatial resolution is low.