1. Field of the Invention
Embodiments of the present invention relate to computer aided human body pose estimation.
2. Description of Related Art
Human body tracking (and more generally, human pose estimation) has many applications, including, but not limited to, visual surveillance, human computer interactions (e.g., gesture driven interfaces) and automated tutoring such as training a person how to imitate a movement. One problem of human body tracking/human pose estimation is to estimate the joint angles of a human body at any time. As can be appreciated from FIGS. 1A and 1B, an articulated human body can be thought of as including at least 10 body parts, which may require approximately 30 parameters to describe the full body articulation (left upper-arm can not be seen in FIG. 1A, but is represented in FIG. 1B). In other words, there is a desire to estimate the human pose in a 30+ dimensional space. This is one of the most challenging problems in the field of computer vision because of occlusion, a high dimensional search space and high variability in appearance due to human shape and clothing.
There is a wide range of approaches to human pose estimation. These algorithms can be broadly divided into two categories: discriminative approaches; and generative approaches. FIG. 2A is used to illustrate discriminative approaches, which estimate a pose by learning appearance based associative models to infer pose from image measurements. FIG. 2B is used to illustrate generative approaches, which estimates a pose by finding the pose that best explains the image observation. In FIG. 2B, the different models 212a . . . n correspond to different values of y, where for generative approaches, the desire is to find the y that maximizes P(x|y), where x is the input image features.
Discriminative approaches involve learning a mapping from observation space to pose space. Here, an observation is a set of features characterizing an image of a person. Such a mapping is generally ambiguous in nature and hence requires multiple functions for mapping. The following papers, which are incorporated herein by reference, describe discriminative approaches to human pose estimation: Agarwal et al., entitled “3D Human Pose from Silhouettes by Relevance Vector Regression,” in Computer Vision and Pattern Recognition (CVPR) 2004; Shakhnarovich et al., entitled “Fast pose estimation with parameter sensitive hashing,” in IEEE International Conference on Computer Vision (ICCV) 2003; and Sminchisescu et al. entitled “Discriminative density propagation for 3d human motion estimation,” in CVPR 2005.
On the other hand, mapping a current pose to an observation is a well defined problem in computer graphics. By building a mapping from pose space to observation space, generative approaches can search the pose space to find the pose that best-matches the current observations, e.g., using a likelihood model. The following papers, which are incorporated herein by reference, describe generative approaches to human pose estimation: Sigal et al., entitled “Measure Locally Reason Globally: Occlusion-sensitive Articulated Pose Estimation,” in CVPR 2006; Gupta et al., entitled “Constraint integration for efficient multi-view pose estimation with self-occlusions” In IEEE Transactions on PAMI; and Sidenbladh et al., entitled “Stochastic tracking of 3d human figures using 2d image motion,” in ECCV 2000.
While discriminative approaches are faster in nature as compared to generative approaches, generative approaches tend to generalize better because they are less constrained on a training pose dataset. However, generative approaches tend to be computationally infeasible because the search is in a high dimensional subspace. Nevertheless, while search or learning a prior model in a high dimensional space is infeasible, dimensionality reduction techniques can be used to embed the high-dimensional pose space to a low dimensional manifold. For example, Gaussian process latent variable models (GPLVM) provide a generative approach that has been used to model the pose-configuration space (Y) as a low dimensional manifold, and the search for the best configuration is performed in the low-dimensional latent space (Z). GPLVM is discussed in a paper by Lawrence entitled “Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data,” Conference on Neural Information Processing Systems (NIPS) 2004, which is incorporated herein by reference.
GPLVM, which at a high level is illustrated by FIG. 3, involves a smooth mapping from latent space (Z) to pose space (Y), and hence tries to keep two points that are far apart in pose space also far apart in latent space. GPLVM however fails to preserve the local distances in pose space, where preserving local distances implies keeping two close points in pose space as close points in latent space.
An extension to GPLVM, referred to as Back Constrained GPLVM (BC-GPLVM), or GP-LVM with back constraints, was presented in a paper by Lawrence et al. entitled “Local Distance Preservation in the GP-LVM through Back Constraints,” ICML 2006, which is incorporated herein by reference. By constraining the points in the pose space to be mapped smoothly to the points in latent space, BC-GPLVM preserves local distances in the pose space. BC-GPLVM is illustrated at a high-level by FIG. 3B.
GPLVM and BC-GPLVM both only consider the pose space when finding a latent space. It would be beneficial to improve upon the GPLVM and BC-GPLVM techniques mentioned above. Such improved techniques would preferably provide faster inference approaches for human pose estimation, and better initialization for searches, as compared to GPLVM and BC-GPLVM.