Inferring human pose from a single image is an aspect of applications such as motion analysis and visual tracking, and is arguably one of the most difficult problems in computer vision. Recent approaches have yielded some favorable results. A description of this can be found in Efficient Matching of Pictorial Structures, P. Felzenszwalb and D. Huttenlocher, IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 2066–2073, 2000, and also in Proposal Maps Driven MCMC for Estimating Human Body Pose in Static Images, M. W. Lee and I. Cohen, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 334–341, 2004, and also in Recovering Human Body Configurations: Combining Segmentation and Recognition, G. Mori, X. Ren, A. Efros, and J. Malik, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages 326–333, 2004, all of which are incorporated by reference herein in their entirety.
For convenience, these approaches may be categorized as deterministic and statistical. Deterministic methods apply deterministic optimization, and the objective function is the matching error between the model and the image data or between the image data and the exemplar set. Descriptions of these concepts can be found in Felzenszwalb and Huttenlocher, which was referenced above, and in Estimating Anthropometry and Pose From a Single Uncalibrated Image, C. Barrn and I. Kakadiaris, Computer Vision and Image Understanding, 81(3):269–284, 3 2001, and also in Fast Pose Estimation with Parameter-Sensitive Hashing, G. Shakhnarovich, P. Viola, and T. Darrell, Proc. IEEE International Conference on Computer Vision, volume 2, pages 750–757, 2003, both of which are incorporated by reference herein in their entirety. An alternative statistical approach builds detectors for different body parts and ranks the assembled configuration based on human-coded criteria. A description of this can be found in G. Mori, et al., which was referenced above.
Despite some success, many challenging issues remain in achieving robust and efficient pose estimation. First, an optimization problem of high dimensionality must be solved, and, consequently, the computation is intractable unless certain assumptions are explicitly made. Such assumptions may regard the background, characteristics of the human subjects, clothing, distance, etc., in order to make the application domain manageable by the proposed algorithms. Accordingly, the application domains have generally been limited to uncluttered backgrounds or to the human body with fixed scale. Descriptions of these concepts can be found in Barrn and Kakadiaris, Felzenszwalb and Huttenlocher, and Mori, et al., which were referenced above. Second, the set of exemplars must be sufficiently large to cover the parameter space necessary to achieve satisfactory estimation results. However, this also results in high computational complexity, as described in Shakhnarovich, which was referenced above. Third, it is difficult to build robust body part detectors except those for faces due to the large appearance variation caused by clothing. A description of this can be found in Rapid Object Detection Using a Boosted Cascade of Simple Features, P. Viola and M. Jones, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 511–518, 2001, which is incorporated by reference herein in its entirety, and in Mori, et al., which was referenced above.
A merit of the statistical formulation for posture estimation is that prior knowledge of human body parts (e.g., appearance, shape, edge and color) can be exploited and integrated into a rigorous probabilistic framework for efficient inference. Ioffe and Forsyth proposed an algorithm that sequentially draws samples of body parts and makes the best prediction by matching the assembled configurations with image observations. A description of this can be found in Finding People by Sampling, Proc. IEEE International Conference on Computer Vision, pages 1092–1097, 1999, which in incorporated herein by reference in its entirety. However, this approach is best applied to estimating human pose in images without clothing or cluttered background, since the method relies solely on edge cues. Sigal et al. applied a non-parametric belief propagation algorithm for inferring the 3-D human pose as the first step of a human tracking algorithm. Background subtraction and images from multiple views facilitated human pose estimation and tracking. Descriptions of these concepts can be found in Attractive People: Assembling Loose-Limbed Models Using Nonparametric Belief Propagation, L. Sigal, M. Isard, B. Sigelman, and M. Black, Advances in Neural Information Processing System 16, MIT Press, 2004, and in PAMPAS: Real-Valued Graphical Models for Computer Vision, M. Isard, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 613–620, 2003, which are incorporated by reference herein in their entirety.
Lee and Cohen applied the Data Driven Markov Chain Monte Carlo (DDMCMC) algorithm to estimate 3-D human pose from single images, wherein the MCMC algorithm traversed the pose parameter space. However, it is unclear how the detailed balance condition and convergence within the Markov chain were ensured. Most importantly, the problem of inferring 3-D body pose from single two-dimensional (2-D) images is intrinsically ill-posed as a consequence of depth ambiguity. Descriptions of these concepts can be found in Lee and Cohen, which was referenced above, and in Image Segmentation by Data-Driven Markov Chain Monte Carlo, Z. Tu and S.-C. Zhu, IEEE Transaction on Pattern Analysis and Machine Intelligence, 24(5):657–673, 2002, which is incorporated by reference herein in its entirety.
Based on the above, there is a need for an improved system and method for inferring human pose from single images that manages complexity and eliminates the need for inordinate assumptions, and that provide reliable results.