Automatic recovery of 3D human pose from monocular image sequences is a challenging and important topic with numerous applications including video indexing, automotive safety, and surveillance. Although current methods are able to recover 3D pose for a single person in controlled environments, they are severely challenged by real world scenarios, such as crowded street scenes, e.g. multiple people in cluttered scenes using a monocular, potentially moving camera.
Probably the most important challenge in articulated 3D tracking is the inherent ambiguity of 3D pose from monocular image evidence. This is particularly true for cluttered real-world scenes with multiple people that are often partially or even fully occluded for longer periods of time. Another important challenge, even for 2D pose recovery, is the complexity of human articulation and appearance. Additionally, complex and dynamically changing backgrounds of realistic scenes complicate data association across multiple frames. While many of these challenges have been addressed individually, addressing all of them simultaneously using a monocular, potentially moving camera has not been achieved.
Due to the difficulties involved in reliable 3D pose estimation, this task has often been considered in controlled laboratory settings with solutions frequently relying on background subtraction and simple image evidence, such as silhouettes or edge-maps. In order to constrain the search in high-dimensional pose spaces these approaches often use multiple calibrated cameras, complex dynamical motion priors, or detailed body models. Their combination allows to achieve impressive results, similar in performance to commercial marker-based motion capture systems. However, realistic street scenes do not satisfy many of the assumptions made by these systems. For such scenes multiple synchronized video streams are difficult to obtain, the appearance of people is significantly more complex, and robust extraction of evidence is challenged by frequent full and partial occlusions, clutter, and camera motion. In order to address these challenges, a number of methods leverage recent advances in people detection and either use detection for pre-filtering and initialization, or integrate detection, tracking and pose estimation within a single “tracking-by-detection” framework.
Estimation of 3D poses from 2D body part positions has been previously proposed. However, this approach was evaluated only in laboratory conditions for a single subject and it remains unclear how well it generalizes to more complex settings with multiple people. There is also a lot of work on predicting 3D poses directly from image features using regression, classification, or a search over a database of exemplars. These methods typically require a large database of training examples to achieve good performance and are challenged by the high variability in appearance of people in realistic settings.
It is known from an article “People-tracking-by-detection and people-detection-by-tracking” by Andriluka M et al, Computer Vision and Pattern Recognition 2008, CVPR 2008 ISBN: 978-1-4244-2242-5 to provide an image processor having a 2D pose detector for estimating a pose of an object in an image, and 2D tracking for applying 2D tracking by detection.