A common task in computer vision applications is to estimate a pose of objects from images acquired of a scene. Herein, the pose is defined as the 6-DOF location and orientation of an object. Pose estimation in scenes with clutter, e.g., other objects, noise, and occlusions, e.g., due to multiple overlapping objects, can be quite challenging. Furthermore, pose estimation in 2D images and videos is sensitive to illumination variation, shadows, and lack of features, e.g., shiny objects without texture.
Pose estimation from range images, in which each pixel includes an estimate of a distance to the objects, does not suffer from these limitations. Range images can be acquired with active light systems, such as laser range scanners, or using active light stereo methods. Range images are often called range maps. Hereinafter, these two terms are synonymous.
If a 3D model of the objects is available, then one can use model-based techniques, where the 3D model of the object is matched to the images or range images of the scene. Model-based pose estimation has been used in many computer vision applications such as object recognition, object tracking, robot navigation, and motion detection.
The main challenge in pose estimation is invariance to partial occlusions, cluttered scenes, and large pose variations. Methods for 2D images and videos generally do not overcome these problems due to their dependency on appearance and sensitivity to illumination, shadows, and scale. Among the most successful attempts are methods based on global appearance, and methods based on local 2D features. Unfortunately, those methods usually require a large number of training examples because they do not explicitly model local variations in the structure of the objects.
Model-based surface matching techniques, using a 3D model, have become popular due to the decreasing cost of 3D scanners. One method uses a viewpoint consistency constraint to establish correspondence between a group of viewpoint-independent image features and the object model, D. Lowe, “The viewpoint consistency constraint,” International Journal of Computer Vision, volume 1, pages 57-72, 1987. The most popular method for aligning 3D models, based on the geometry, is the iterative closest point (ICP) method, that has been improved by using geometric descriptors, N. Gelfand, N. Mitra, L. Guibas, and H. Pottmann, “Robust global registration,” Proceeding Eurographics Symposium on Geometry Processing, 2005. However, those methods only address the problem of fine registration where an initial pose estimate is required.
Geometric hashing is an efficient method for establishing multi-view correspondence and object pose due to its insensitivity of the matching time to the number of views. However, the building of the hash table is time consuming and the matching process is sensitive to image resolution and surface sampling.
Another method matches 3D features, or shape descriptors, to range images using curvature features by calculating principal curvatures, Dorai et al., “Cosmos—a representation scheme for 3d free-form objects,” PAMI, 19(10), pp. 1115-1130, 1997. That method requires the surface to be smooth and twice differentiable, and thus, that method is sensitive to noise. Moreover, occluded objects cannot be handled.
Another method uses “spin-image” surface signatures to convert an image of a surface to a histogram, A. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3d scenes,” PAMI, 21(5), pp. 433-449, 1999. That method yields good results with cluttered scenes and occluded objects. But that method is time-consuming, sensitive to image resolution, and might lead to ambiguous matches.
Another method constructs a multidimensional table representation, referred to as tensors, from multiple unordered range images, and a hash-table based voting scheme is used to match the tensor to objects in a scene. That method is used for object recognition and image segmentation, A. Mian, M. Bennamoun, and R. Owens, “Three-dimensional model-based object recognition and segmentation in cluttered scenes,” PAMI, 28(12), pp. 1584-1601, 2006. However, that method requires fine geometry and has a runtime of several minutes, which is inadequate for real-time applications.
Shang et al., in “Discrete Pose Space Estimation to Improve ICP-based Tracking,” Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling, pp. 523-530, June 2005, use the bounded Hough transform (BHT) to determine an initial estimate of the pose before performing the ICP. However, that method was developed for object tracking, and not pose estimation.
A large class of methods use a deformable (morphable) 3D model and minimize a cost term such that the model projections match to input images, M. Jones and T. Poggio, “Multidimensional Morphable Models: A Framework for Representing and Matching Object Classes,” International Journal of Computer Vision, vol. 29, no. 2, pp. 107-131, August 1998; V. Blanz and T. Vetter, “A Morphable Model for the Synthesis of 3D Faces,” Proc. ACM SIGGRAPH, pp. 187-194, August 1999. However, optimizing many parameters while projecting the 3D model is inefficient and those methods also require an initial pose estimate.
Another method pre-computes range maps of the model, and uses a tree structure and geometric probing, Greenspan et al., “Geometric Probing of Dense Range Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 495-508, April 2002. However, the time to process the model depends on the size of the object, and requires at least four seconds for reliable results, making the method unsuitable for real-time applications.
Breitenstein et al., in “Real-Time Face Pose Estimation from Single Range Images,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, June 2008, estimate the pose of a face by comparing pre-computed reference range images with an input range image. However, that method relies on a signature to approximate the 3D face in a low resolution and search in the full pose space.
Recent advance in computer graphics enables the simulation of real world behavior using models. For example, it is possible to generate 3D virtual models of objects and a variety of photorealistic images that can be used as input for computer vision applications. In addition, accurate collision detection and response simulation in 3D virtual scenes provide valuable information to motion analysis in computer vision research.
General Purpose GPU (GPGPU) Applications
Graphics processing units (GPUs) can accelerate graphics procedures that cannot be performed efficiently with a conventional CPU. Because computer graphics and image processing have similar computational requirements, there has been a substantial amount of work to apply the processing capabilities of GPUs to computer vision and image processing applications.
The main disadvantage of using GPUs in non-graphics applications is the limitation of the ‘scatter’ operation. In GPUs, the fragment processor cannot perform a direct scatter operation because the location of each fragment on a grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. Each parallel processing thread can only output a single pixel value, which can be four floating point numbers at most. In addition to this, it is inconvenient to use off-screen rendering because users need to understand the internal structure of graphics pipeline, and the usage of texture.
On the other hand, there are new architecture platforms for GPGPUs. For example, NVIDIA's Compute Unified Device Architecture (CUDA) enables the use of the entire video memory space as a linear memory with flexible gather (read) and scatter (write) operations. In addition, users can control the multithreading structure dynamically, which provides a more efficient assignment of the processing threads and memory.
The computational throughput of modern GPUs has greatly increased. For example, NVIDIA's G80 (GeForce 8800) has about 700 million transistors and performs approximately at 350 GFLOPS, while the 2.66 GHz Intel Core™2 Duo processor performs at 50 GFLOPS. Furthermore, modern GPUs are equipped with high-level shading capabilities that facilitate the programmability of the functions of the GPU pipeline. Therefore, GPUs can be used not only for conventional 3D graphics applications, but also for other purposes. Two of the most promising off-the-shelf applications using GPGPUs are image processing and computer vision, in that most of these applications require single instruction, multiple data (SIMD) style processing of image pixels and features, which are also common to computer graphics procedures.