The ability to accurately estimate the three-dimensional position and orientation of an object, based solely upon video images of the object, is of increasing interest in the field of computer vision. For example, interactive human interface applications require the ability to quickly and accurately track the pose of a user. Information regarding the user's body position must be at or near real-time, to adjust the display of the interface in a meaningful, timely manner. For instance, an application which displays a three-dimensional view of an object requires accurate tracking of the position and orientation of the user's head, in order to present the frames containing an image of the object from an appropriate perspective.
In general, previous approaches to pose tracking often relied on assumed models of shape, to track motion in three dimensions from intensity data, i.e., image brightness. Other approaches have employed depth data in conjunction with the image brightness information to estimate pose. Direct parametric motion has also been explored for both rigid and affine models. In this approach, it is preferable to utilize constraints in the analysis of the image data, to reduce the number of computations that are required to estimate the pose of a figure. A comprehensive description of brightness constraints that are implied by the rigid motion of an object was presented by Horn and Weldon, “Direct Methods for Recovering Motion”, International Journal of Computer Vision 2:51–76 (1998). Image stabilization and object tracking using an affine model with direct image intensity constraints is described in Bergen et al., “Hierarchical Model-Based Motion Estimation”, European Conference on Computer Vision, pages 237–252 (1992). This reference discloses the use of a coarse-to-fine algorithm to solve for large motions.
The application of affine models to track the motion of a user's head, as well as the use of non-rigid models to capture expression, is described in Black and Yacoob, “Tracking and Recognizing Rigid and Non-Rigid Facial Motions Using Local Parametric Models of Image Motion,” International Conference on Computer Vision (1995). This paper describes the use of a planar face-shape for tracking gross head motion, which limits the accuracy and range of motion that can be captured. A similar approach, using ellipsoidal shape models and perspective projection, is described in Basu et al., “Motion Regularization for Model-Based Head Tracking”, International Conference on Pattern Recognition (1996). The method described in this publication utilizes a pre-computed optic flow representation, instead of direct brightness constraints. It explicitly recovers rigid motion parameters, rather than an affine motion in the image plane. Rigid motion is represented using Eular angles, which can pose certain difficulties at singularities.
The tracking of articulated-body motion presents additional complexities within the general field of pose estimation. A variety of different techniques have been proposed for this particular problem. Some approaches use constraints from widely separated views to disambiguate partially occluded motions, without computing depth values. Examples of these approaches are described, for example, in Yamamoto et al., “Incremental Tracking of Human Actions From Multiple Views”, Proc. IEEE CVPR, pages 2–7, Santa Barbara, Calif. (1998), and Gavrila and Davis, “3D Model-Based Tracking of Humans in Action: A Multi-View Approach”, Proc. CVPR, pages 73–80, San Francisco, Calif. (June 1996).
The use of a twist representation for rigid motion, which is more stable and efficient to compute, is described in Bregler and Malik, “Tracking People With Twists and Exponential Maps”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santa Barbara, Calif. (June 1998). This approach is especially suited to the estimation of chained articulated motion. The estimation of twist parameters is expressed directly in terms of an image brightness constraint with a scaled orthographic projection model. It assumes a generic ellipsoidal model of object shape. To recover motion and depth, constraints from articulation and information from multiple widely-spaced camera views is used. This method is not able to estimate the rigid translation in depth of an unconnected object, given a single view.
The techniques which exhibit the most robustness tend to fit the observed motion data to a parametric model before assigning specific pointwise correspondences between successive images. Typically, this approach results in non-linear constraint equations which must be solved using iterative gradient descent or relaxation methods, as described in Pentland and Horowitz, “Recovery of Non-Rigid Motion and Structure”, PAMI, 13(7), pp. 730–742 (July 1991), and Lin, “Tracking Articulated Objects in Real-Time Range Image Sequences”, Proc. IEEE ICCV, Volume 1, pages 648–653, Greece (September 1999). The papers by Bregler et al. and Yamamoto et al. provide notable exceptions to this general trend. Both result in systems with linear constraint equations, that are created by combining articulated-body models with dense optical flow.
In the approach suggested by Yamamoto et al., the constraints between limbs are maintained by sequentially estimating the motion of each parent limb, adjusting the hypothesized position of a child limb, and then estimating the further motion of the child limb. This approach is conceptually simpler than the one suggested by Bregler et al., but results in fewer constraints on the motion of the parent limbs. In contrast, the method of Bregler et al. takes full advantage of the information provided by child limbs, to further constrain the estimated motions of the parents.
Both Yamamoto et al. and Bregler et al. use a first-order Taylor series approximation to the camera-body rotation matrix, to reduce the number of parameters that are used to represent this matrix. Furthermore, both use an articulated model to generate depth values that are needed to linearize the mapping from three-dimensional body motions to observe two-dimensional camera-plane motions.
The various techniques which employ depth information to estimate pose have typically utilized sparse depth data, e.g. representative sample points in an image. Recent imaging techniques now make it possible to obtain dense depth information, e.g. a depth value for all, or almost all, of the pixels in an image. Furthermore, this data can be obtained at video rates, so that it is real-time, or near real-time. It is an objective of the present invention to provide techniques for estimating pose which employ dense depth data.