Robust recovery of 3D human pose in monocular images or videos is an actively growing field. Effective solutions would lead to breakthroughs in a wide range of applications spanning visual surveillance, video indexing and retrieval, human-computer interfaces, and so on. The problem is challenging due to both the internal complexity of the articulated human body and the external variations of the scene. The internal complexity stems from the number of degrees of freedom in the human body, ambiguities of projection onto the image plane, varying body shape, self-occlusion, among others. The external variations include cluttered background, varying clothes, among others.
There are two general classes of approaches for human pose estimation: generative methods and discriminative methods. The generative methods recover the hidden states (human pose) within an analysis-by-synthesis loop. They are natural and flexible to represent the hidden states and appearance of the human body, but their applicability is partly prohibited by the high computational cost to infer the distribution on the hidden states and by the difficulties of constructing the observation models. These disadvantages have motivated the advent of discriminative methods that learn direct image-to-pose mappings by training on a dataset with labeled human poses. Compared to generative models, the discriminative models, once trained, have the advantage of much faster test speed, although in some cases they cannot obtain estimates as precise as generative methods do.
Among the image representations used by the discriminative methods, the bag-of-words model has been used. However, among the majority works to date, the bag of visual words are usually obtained by unsupervised clustering methods such as K-means. Visual words obtained this way actually capture the most common patterns in the entire training set, and are good features for coarse-grain recognition tasks such as object detection and classification. However, such representations may lack the needed power to discriminate subtle differences in recognition tasks such as pose estimation.
The generative methods construct observation likelihood or cost functions that compute how well the body configuration is aligned with the observation. Then complex sampling or nonlinear optimization methods are used to infer the likelihood peaks within an analysis-by-synthesis loop. And models of state priors or image statistics are learned by supervised or unsupervised procedures to help the pose estimation.
The discriminative methods usually have fast computational speed, while the estimates by the generative methods are often more precise. Therefore, researchers have attempted to combine both discriminative and generative methods and expect to explore the advantages of both. For example, a discriminative method to directly recover the model parameters can be done using a mixture of regressors. The recovered parameters are used to initialize a generative model for more detailed estimation. In, the discriminative model is tuned using samples from the generative model, and the generative model is optimized to produce inferences close to the ones predicted by the current discriminative model. Both the generative and the combinative methods require high computational cost in inference.