1. Field of Disclosure
The disclosure generally relates to the field of tracking motion of a system, and more specifically, to pose estimation from visual input.
2. Description of the Related Art
Recovering human pose from visual observations is a challenging problem in the field of computer vision because of the complexity of the models which relate observation with pose. An effective solution to this problem has many applications in areas such as video coding, visual surveillance, human gesture recognition, biomechanics, video indexing and retrieval, character animation, and man-machine interaction. See D. Gavrila, “The visual analysis of human movement: a survey”, Computer Vision and Image Understanding, 73(1):82-98 (1999); see also L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis” Pattern Recog., 36(3):585-601 (2003); see also T. B. Moeslund, A. Hilton, and V. Kruger, “A survey of advances in vision-based human motion capture and analysis”, Computer Vision and Image Understanding, 104(2,3):90-126 (2006), all of which are incorporated by reference herein in their entirety.
One of the major difficulties in estimating pose from visual input involves the recovery of the large number of degrees of freedom in movements which are often subject to kinematic constraints such as joint limit avoidance, and self penetration avoidance between two body segments. Such difficulties are compounded with insufficient temporal or spatial resolution, ambiguities in the projection of human motion onto the image plane, and when a certain configuration creates self occlusions. Other challenges include the effects of varying illumination and therefore appearance, variations of appearance due to the subject's attire, required camera configuration, and real time performance for certain applications.
Traditionally there are two categories of approaches in solving the pose estimation problem, model based approaches and learning based approaches. Model-based approaches rely on an explicitly known parametric human model, and recover pose either by inverting the kinematics from known image feature points on each body segment (See C. Barron and I. A. Kakadiaris, “Estimating anthropometry and pose from a single image”, Computer Vision and Pattern Recognition, 1:669-676 (2000); see also C. J. Taylor, “Reconstruction of articulated objects from point correspondences in a single uncalibrated image”, Computer Vision and Image Understanding, 80(3):349-363 (2000), both of which are incorporated by reference herein in their entirety), or by searching high dimensional configuration spaces which is typically formulated deterministically as a nonlinear optimization problem (See J. M. Rehg and T. Kanade, “Model-based tracking of selfoccluding articulated objects”, ICCV, pages 612-617 (1995), the content of which is incorporated by reference herein in its entirety), or probabilistically as a maximum likelihood problem (See H. Sidenbladh, M. J. Black, and D. J. Fleet, “Stochastic tracking of 3D human figures using 2D image motion”, ECCV, pages 702-718, (2000), the content of which is incorporated by reference herein in its entirety). The model-based approaches typically require good initialization, high dimensional feature points, and are computationally intensive. In addition, the model-based approaches generally do not enforce bodily constraints such as joint limitation and self penetration avoidance, they often generate erroneous estimation results.
In contrast, learning based approaches directly estimate body pose from observable image quantities. See A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular images”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(1):44-58 (2006), see also G. Mori and J. Malik, “Recovering 3d human body configurations using shape contexts”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(7):1052-1062 (2006), both of which are incorporated by reference herein in their entirety. In example based learning, inferring pose is typically formulated as a k-nearest neighbors search problem where the input is matched to a database of training examples whose three-dimensional (3D) pose is known. Computational complexity of performing similarity search in high dimensional spaces and on very large data sets has limited the applicability of these approaches. Although faster approximate similarity search algorithms have been developed based on Locally-Sensitive Hashing, computation speed remains a challenge with learning based approaches. See G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estimation with parameter sensitive hashing”, ICCV, 2:750-757 (2003), the content of which is incorporated by reference herein in its entirety. Similar to the model based approaches, the learning based approaches also tend to be computationally intensive. In addition, in order for a pose to be properly recognized using a learning based approach, a system must process (“learn”) the pose before hand. Thus, generally only a small set of pre-programmed human pose can be recognized using the learning based approaches.
Hence, there is lacking, inter alia, a system and method for efficiently and accurately estimating human pose in real time.