1. Technical Field
This disclosure relates to automatically recognizing one or more actions of a human.
2. Description of Related Art
Recognizing basic human actions (e.g. walking, sitting down, and waving hands) from a monocular view may be important for many applications, such as video surveillance, human computer interaction, and video content retrieval.
Some research efforts focus on recovering human poses. See A. Agarwal, B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression”, CVPR, pp. 882-888, 2004; A. Elgammal, C. S. Lee, “Inferring 3D body pose from silhouettes using activity manifold learning”, CVPR, pp. 681-688, 2004; and M. W. Lee, I. Cohen, “Proposal Maps Driven MCMC for Estimating Human Body Pose in Static Images”, CVPR, pp. 334-341, 2004. This may be a necessary step for view-invariant human action recognition. However, 3D pose reconstruction from a single viewpoint may be difficult. A large number of parameters may need to be estimated and ambiguity may be caused by perspective projection.
Alternatively, example based methods may store a database of example human figures with known 3D parameters. See D. Ramanan, D. A. Forsyth, “Automatic Annotation of Everyday Movements”, NIPS, 2003; and G. Shakhnarovich, P. Viola, T. Darrell, “Fast Pose Estimation with Parameter-Sensitive Hashing”, ICCV, pp. 750-757, 2003. A 3D pose may be estimated by searching for examples similar to the input image. Comparisons with known examples may be easier than inferring unknown parameters. However, good coverage in a high dimensional parameter space may need a large number of examples. The difficulty in getting enough examples may make the pose recovered not highly accurate.
2D approaches to action recognition have also been proposed. See M. Blank, L. Gorelick, E. Schetman, M. Irani, R. Basri, “Actions as space-time shapes”, ICCV, pp. 1395-1402, 2005; A. F. Bobick, J. W. Davis, “The recognition of human movement using temporal templates”, PAMI 23(3), pp. 257-267, 2001; Y. Ke, R. Sukthankar, M. Hebert, “Efficient Visual Event Detection using Volumetric Features”, ICCV, pp. 166-173, 2005; I. Laptev, T. Lindeberg, “Space-time interest points”, ICCV, pp. 432-439, 2003; and A. Yilmaz, M. Shah, “Actions sketch: a novel action representation”, CVPR, pp. 984-989, 2005. These approaches may be roughly grouped as space-time shape based, see M. Blank, L. Gorelick, E. Schechtman, M. Irani, R. Basri, “Actions as space-time shapes”, ICCV, pp. 1395-1402, 2005; and A. Yilmaz, M. Shah, “Actions sketch: a novel action representation”, CVPR, pp. 984-989, 2005; interest point based, see Y. Ke, R. Sukthankar, M. Hebert, “Efficient Visual Event Detection using Volumetric Features”, ICCV, pp. 166-173, 2005; and 1. Laptev, T. Lindeberg, “Space-time interest points”, ICCV, pp. 432-439, 2003; and motion template based, see A. F. Bobick, J. W. Davis, “The recognition of human movement using temporal templates”, PAMI 23(3), pp. 257-267, 2001. They may work effectively under the assumption that the viewpoint is relatively fixed (e.g., from frontal or lateral view), and, in some cases, with small variance.
The lack of a view-invariant action representation may limit the applications of such 2D based approaches. To address this limitation, some approaches may resort to using multiple cameras. See A. F. Bobick, J. W. Davis, “The recognition of human movement using temporal templates”, PAMI 23(3), pp. 257-267, 2001.
A truly view-invariant approach may need knowledge of 3D human poses, which can be robustly recovered from multiple views. See R. Kehl, M. Bray, L. J. Van Gool, “Full Body Tracking from Multiple Views Using Stochastic Sampling”, CVPR, pp. 129-136, 2005; and D. Weinland, R. Ronfard, and E. Boyer, “Free Viewpoint Action Recognition using Motion History Volumes”, CVIU, 103(2-3), pp. 249-257, 2006. A more challenging problem can be to recover 3D poses from a single view. Some methods may learn a direct mapping from the image feature space (e.g. silhouette) to the parameter space (e.g. 3-D pose) using techniques such as regression, see A. Agarwal, B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector Regression”, CVPR, pp. 882-888, 2004, or manifold embedding, see A. Elgammal, C. S. Lee, “Inferring 3D body pose from silhouettes using activity manifold learning”, CVPR, pp. 681-688, 2004. However, such mapping may be multi-valued, and it may be difficult for direct mapping to maintain multiple hypotheses over time. Other approaches may directly explore through the parameter space and search for an optimal solution. See M. W. Lee, I. Cohen, “Proposal Maps Driven MCMC for Estimating Human Body Pose in Static Images”, CVPR, pp. 334-341, 2004. Due to the high dimensionality, such approaches may use sampling based techniques and look for image evidences to guide the sampler, but the computational complexity may still be very high.
Some approaches to the pose tracking or the action recognition task use graph models, such as Hidden Markov Models, to exploit temporal constraints. See D. Ramanan, D. A. Forsyth, “Automatic Annotation of Everyday Movements”, NIPS, 2003; Conditional Random Fields; C. Sminchisescu, A. Kanaujia, Z. Li, D. Metaxas, “Conditional models for contextual human motion recognition”, ICCV, pp. 1808-1815, 2005.