Detecting humans and estimating their poses from a single image is a fundamental problem for a range of applications, such as image retrieval and understanding. While humans are capable of easily determining the locations and poses from visual information contained in photographs, it is difficult to represent image data in a way that allows machines to successfully make this determination. The related problems of detecting humans and classifying their pose have conventionally been approached separately with each problem presenting significant challenges to researchers.
Traditional research focuses on deriving an automatic procedure that locates the regions of a two dimensional image that contain human bodies in an arbitrary pose. The human detection problem is hard because of the wide variability that images of humans exhibit. Given that it is impractical to explicitly model nuisance factors such as clothing, lighting conditions, viewpoint, body pose, partial and/or self occlusions, one can learn a descriptive model of human/non human statistics. The problem then reduces to a binary classification task for which general statistical learning techniques can be directly applied. Consequently, the main focus of research on human detection has traditionally been on deriving a suitable representation, i.e., one that is most insensitive to typical appearance variations, so that it provides good features to a standard classifier.
Numerous representation schemes have traditionally been exploited for human detection, e.g., Haar wavelets, edges, gradients and second derivatives, and regions from image segmentation. With these representations, algorithms have been applied for the detection process such as template matching, support vector machine, Adaboost, and grouping, to name a few. Examples of these techniques are set forth in Gavrila, D. M. and V. Philomin, Real-time Object Detection for Smart Vehicles, Proc. ICCV, pages 87-93, 1999; Ronfard, R., et al., Learning to Parse Pictures of People. Proc. ECCV, pages 700-714, 2002; Viola, P., et al., Detecting Pedestrians Using Patterns of Motion and Appearance. Proc. ICCV, pages 734-741, 2003; and Mori, G., et al., Recovering Human Body Configurations: Combining Segmentation and Recognition. Proc. CVPR, pages 326-333, 2004, which are all incorporated by reference herein in their entirety.
Recently local descriptors based on histograms of gradient orientations have proven to be particularly successful for human detection tasks. The main idea is to use distributions of gradient orientations in order to be insensitive to color, brightness and contrast changes and, to some extent, local deformations. However, conventional models still generally fail to account for more macroscopic variations, due for example to changes in pose.
The problem of classifying human pose presents its own challenges. Humans are highly articulated objects with many degrees of freedom, which makes defining pose classes a difficult problem. Even with manual labeling, it is difficult to judge the distance between two poses or cluster them. Most conventional approaches to pose estimation are based on body part detectors, using either edge, shape, color and texture cues, or learned from training data. The optimal configuration of the part assembly is then computed using dynamic programming or by performing inference on a generative probabilistic model, using either Data Driven Markov Chain Monte Carlo, Belief Propagation or its non-Gaussian extensions as described by Sigal, L., et al., Attractive People: Assembling Loose-Limbed Models Using Non-Parametric Belief Propagation, NIPS, pages 1539-1546, 2003 which is incorporated by reference herein in its entirety.
The approaches above focus on only one of the two problems, either detection or pose estimation. In human detection, since a simple yes/no answer is often desired, there is little or no advantage to introducing a complex model with latent variables associated to physical quantities. In pose estimation, on the other hand, the goal is to infer these quantities and therefore a full generative model is a natural approach. Thus, human detection and pose estimation conventionally require computing two entirely different models and solving the problems in a completely independent manner. Further, using conventional techniques, the pose estimation problem cannot even be approached unless there is prior knowledge the image contains a human. If solutions to both the problems of human detection and pose estimation are needed, conventional techniques are inefficient and require significant computational cost.
What is needed is a method for efficiently performing human detection and pose classification from a single derived probabilistic model.