The present invention relates to fast articulated motion tracking.
One of the fundamental problems in computer vision is estimating the 3D motion of humans. Motion capture is an essential part in a wide range of modern industries, ranging from sport science, over biomechanics to animation for games and movies. The state-of-the-art in industrial applications is still marker-based optical capture systems, which enable accurate capture at the cost of requiring a complex setup of cameras and markers. Marker-less methods that are able to track the motion of characters without interfering with the scene geometry, rely on pose detection based on some image features. These approaches, however, assume that the poses have been previously observed during training.
Most related to the present invention is a real-time tracking system called Pfinder proposed by Wren et al. [C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. TPAMI, 19:780-785, 1997]. It models the human by 2D Gaussians in the image domain and represents the appearance of each blob by an uncorrelated Gaussian in the color space. The background is modeled by Gaussians in the color space for each image pixel. Pose estimation is finally formulated as 2D blob detection, i.e., each image pixel is assigned to the background or to one of the human blobs. The final 2D pose is obtained by iterative morphological growing operations and 2D Markov priors. The approach has been extended to the multi-view case in [S. Yonemoto, D. Arita, and R. Taniguchi. Real-time human motion analysis and ik-based human figure control. In Workshop on Human Motion, pages 149-154, 2000] where the blobs are detected in each image and the 3D position of the blobs is then reconstructed using inverse kinematics.
Other approaches to Human pose estimation without silhouette information combine segmentation with a shape prior and pose estimation. Graph-cut segmentation has been used as well as level set segmentation together with motion features or an analysis-by-synthesis approach. Handheld video cameras and a structure-from-motion approach are also used to calibrate the moving cameras. While these approaches iterate over segmentation and pose estimation, the energy functional commonly used for level-set segmentation can be directly integrated in the pose estimation scheme to speed-up the computation. The approach, however, does not achieve real-time performance and requires 15 seconds per frame for a multi-view sequence recorded with 4 cameras at resolution of 656×490 pixels.
Implicit surfaces have been used for 3D surface reconstruction. More particularly, a human model comprising a skeleton, implicit surfaces to simulate muscles and fat tissue, and a polygonal surface for the skin have been used for multi-view shape reconstruction from dynamic 3D point clouds and silhouette data. The implicit surfaces are modeled by Gaussians. Since the estimation of the shape and pose parameters is performed by several processes, the whole approach is very time consuming and not suitable for real-time application. To increase the reliability, motion priors can be used to improve model fitting. An implicit surface model of a human has been matched to 3D points with known normals.
It is therefore an object of the invention, to provide a method for acquiring a model and for tracking the pose of an actor in a digital video that is fast and efficient.