1. Field
Intelligent Video Surveillance (IVS) systems may be used to detect events of interest in video feeds in real-time or offline (e.g., by reviewing previously recorded and stored video). Typically, this task may be accomplished by detecting, tracking and/or analyzing targets of interest. This disclosure relates to video surveillance, such as video surveillance methods and systems and video verification methods and systems. Video surveillance systems, devices and methods are disclosed that may analyze video images to provide more models of detected human objects within the video, including modeling of the shape and pose of the detected human objects. Accessories of the human objects in the video image may be detected and modeled.
2. Background
With advancements in computer vision technology and the emergence of matured technologies for detection and tracking of human targets from a significant stand-off point, there is a greater need for cognitive video analytics with the ability to infer subtle attributes of humans and analyze human behavior. Initial work on marker-less motion-capture focused on accurate 3D pose estimation from single and multi-view imagery. A comprehensive survey of existing state of the art techniques in vision-based motion capture is provided by T. B. Moeslund, A. Hilton and V. Kruger in “A Survey of Advances in Vision-Based Human Motion Capture and Analysis,” (Computer Vision and Image Understanding, 104(2-3):90-126, 2006). Bregler and Malik in “Twist Based Acquisition and Tracking of Animal and Human Kinematics,” (International Journal of Computer Vision, 56(3):179-194, 2004) proposed a representation for articulated human models using twists that has been widely employed in a number of single and multiple camera based motion capture systems. Compared to earlier approaches that modeled human shapes with cylindrical or superquadrics parts, current methods use more accurate modeling of 3D human shapes using SCAPE body models (see, e.g., A. O. Balan and M. J. Black “The Naked Truth: Estimating Body Shape Under Clothing” (ECCV (2), pages 15-29, 2008)) or CAESAR dataset (see, e.g., B. Allen, B. Curless, and Z. Popovic “The Space of Human Body Shapes: Reconstruction and Parameterization from Range Scans,” (ACM SIGGRAPH, 2003)). A number of recent multi-camera based systems proposed by Balan and Sigal employed SCAPE data to model variability in 3D human shapes due to anthropometry and pose. They have used these shape models to estimate human body shape under loose clothing and also efficiently track across multiple frames. Guan et. al. in “Estimating Human Shape and Pose from a Single Image,” (ICCV, pages 1381-1388. IEEE, 2009) used SCAPE based shape model to perform height-constrained estimation of body shape. However, these approaches lack an articulated skeleton underlying the human body shape. The 3D shape deformation of body surface is captured by tracking the 3D mesh surfaces directly. Deforming the 3D mesh while maintaining the surface smoothness is not only computationally demanding but also ill-constrained, occasionally causing poor surface deformation due to noisy silhouettes (or visual hull).
Other approaches include:                L. Mundermann, S. Corazza and T. P. Andriacchi “Accurately Measuring Human Movement Using Articulated ICP with Soft-Joint Constraints and a Repository of Articulated Models” (CVPR, IEEE Computer Society, 2007).        J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn and H. P. Seidel “Motion Capture Using Joint Skeleton Tracking and Surface Estimation” (IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1746-1753, 2009).        C. Stoll, J. Gall, E. de Aguiar, S. Thrun, and C. Theobalt “Video-Based Reconstruction of Animatable Human Characters” (ACM Trans. Graph., 29(6):139, 2010)        J. Gall, A. Yao and L. J. V. Gool. “2d Action Recognition Serves 3d Human Pose Estimation” (ECCV (3), pages 425-438, 2010)        G. Pons-Moll, A. Baak, T. Helten, M. Muller, H. P. Seidel and B. Rosenhahn. “Multisensor-Fusion for 3d Full-Body Human Motion Capture” (CVPR, pages 663-670, 2010)        Y. Chen, T. K. Kim and R. Cipolla “Inferring 3d Shapes and Deformations from Single Views” (ECCV (3), pages 300-313, 2010)Some of these approaches develop a model with an underlying skeleton. However, detailed 3D human shape estimation from multi-view imagery is still a difficult problem that does not have satisfactory solution. The articles referenced in this disclosure are all incorporated by reference in their entirety.        
The embodiments described here address some of these problems of existing systems.