The problem of video-based figure tracking has received a great deal of recent attention, and is the subject of a growing commercial interest. The essential idea is to employ a kinematic model of a figure to estimate its motion from a sequence of video images, given a known starting pose of the figure. The result is a 3-D description of figure motion which can be employed in a number of advanced video and multi-media applications. These applications include video editing, motion capture for computer graphics animation, content-based video retrieval, athletic performance analysis, surveillance, and advanced user-interfaces. In the most interesting applications, the figures are people.
One way to track the moving body of a human figures, i.e., the head, torso, and limbs, is to model the body as an articulated object composed of rigid parts, called links, that are connected by movable joints. The kinematics of an articulated object provide the most fundamental constraint on its motion. There has been a significant amount of research in the use of 3-D kinematic models for the visual tracking of people.
Kinematic models describe the possible motions, or degrees of freedom (DOF) of the articulated object. In the case of the human figure, kinematic models capture basic skeletal constraints, such as the fact that the knee acts as a hinge allowing only one degree of rotational freedom between the lower and upper leg.
The goal of figure tracking is to the estimate the 3-D motion of the figure using one or more sequences of video images. Typically, motion is represented as a trajectory of the figure in a state space. The trajectory is described by kinematic parameters, such as joint angles and link lengths. In applications where it is possible to use multiple video cameras to image a moving figure from a wide variety of viewpoints, good results for 3-D tracking can be obtained. In fact, this approach, in combination with the use of retro-reflective targets placed on the figure, is the basis for the optical motion capture industry.
However, there are many interesting applications in video editing and retrieval where only a single camera view of a moving figure is available, for example movies and television broadcasts. When traditional 3-D kinematic model-based tracking techniques are applied in these situations, the performance of the tracker will often be poor. This is due to the presence of kinematic singularities. Singularities arise when instantaneous changes in the state space of the figure produce no appreciable trajectory change in the image measurements of the kinematic parameters. For example, when the arm is rotated towards the camera, it may not appear to be moving very much. This situation causes standard gradient-based estimation schemes to fail.
Furthermore, in many video processing applications, there may not be a great deal of prior information available about the moving figures. In particular, dimensions such as the lengths of the arms and legs may not be known. Thus, it would be desirable to infer as much as possible about the 3-D kinematics from the video imagery itself.
The prior art can be divided into 2-D and 3-D methods. In 2-D tracking methods, the figure is represented as a collection of templates, or pixel regions, that follow a simple 2-D motion model, such as affine image flow. These methods suffer from two problems that make them unsuitable for many applications. First, there is no obvious method for extracting 3-D information from the output of these methods, as they have no straightforward connection to the 3-D kinematics of the object. As a result, it is not possible, for example, to synthesize the tracked motion as it would be imaged from a novel camera viewpoint, or to use motion of the templates to animate a different shaped object in 3-D. Second, these methods typically represent the image motion with more parameters than a kinematically-motivated model would require, making them more susceptible to noise problems.
There has been a great deal of work on 3-D human body tracking using 3-D kinematic models. Most of these 3-D models employ gradient-based estimation schemes, and, therefore, are vulnerable to the effects of kinematic singularities. Methods that do not use gradient techniques usually employ an ad-hoc generate-and-test strategy to search through state space. The high dimensionality of the state space for an articulated figure makes these methods dramatically slower than gradient-based techniques that use the local error surface gradient to quickly identify good search directions. As a result, generate-and-test strategies are not a compelling option for practical applications, for example, applications that demand results in real time.
Gradient-based 3-D tracking methods exhibit poor performance in the vicinity of kinematic singularities. This effect can be illustrated using a simple one link object 100 depicted in FIG. 1a. There, the link 100 has one DOF due to joint 101 movably fixed to some arbitrary base. The joint 101 has an axis of rotation perpendicular to the plane of FIG. 1a. The joint 101 allows the object 100 to rotate by the angle 74 in the plane of the Figure.
Consider a point feature 102 at the distal end of the link 100. As the angle .theta. varies, the feature 102 will trace out a circle in the image plane, and any instantaneous changes in state will produce an immediate change in the position of the feature 102. Another way to state this is that the velocity vector for the feature 102, V.sub..theta., is never parallel to the viewing direction, which in this case is perpendicular to the page.
In FIG. 1b, the object 100 has an additional DOF. The extra DOF is provided by a mechanism that allows the plane in which the point feature 102 travels to "tilt" relative to the plane of the page. The Cartesian position (x, y) of the point feature 102 is a function of the two state variables .theta. and .phi. given by: EQU x=cos(.phi.)sin(.theta.),y=cos(.theta.).
This is simply a spherical coordinate system of unit radius with the camera viewpoint along the z axis.
The partial derivative (velocity) of any point feature position with respect to the state, also called the "Jacobian," can be expressed as: ##EQU1##
Singularities arise when the Jacobian matrix J loses rank. In this case, rank is lost when either sin(.phi.) or sin(.theta.) is equal to zero. In both cases, J.sub.singq dq=0 for state changes dq=[10 ].sup.T, implying that changes in .phi. cannot be recovered from point feature measurements in this configurations.
Singularities impact visual tracking by their effect on state estimation using error minimization. Consider tracking the object 100 of FIG. 1b using the well known Levenberg-Marquardt update step: EQU q.sub.k =q.sub.k-1 +dq.sub.k =q.sub.k-1 -(J.sup.T J+.LAMBDA.).sup.- J.sup.T R,
where .LAMBDA. is a stabilizing matrix with diagonal entries. See Dennis et al., "Numerical Methods for Unconstrained Optimization and Nonlinear Equations," Prentice-Hall, Englewood Cliffs, N.J., 1983 for details.
At the singularity sin(.phi.)=0, the update step for all trajectories has the form dq =[0 C], implying that no updates to .phi. will occur regardless of the measured motion of the point feature 102. This singularity occurs, for example, when the link rotates through a plane parallel to the image plane, resulting in a point velocity V.sub..phi. which is parallel to the camera or viewing axis.
FIG. 2 graphically illustrates the practical implications of singularities on tracker performance. In FIG. 2, the x-axis plots iterations, and the y-axis plots the angle .phi. in terms of radians. The stair-stepped solid line 201 corresponds to discrete steps in .phi. of a simulation of the two DOF object 100 of FIG. 1b. The solid line 201 shows the state estimates produced by the update equation as a function of the number of iterations of the solver.
The increased "damping" in the estimator, shown by the dotted line 202, as the trajectory approaches the point when .phi.=0 is symptomatic of tracking near singularities. In this example, the singular state was never reached. In fact, at point 204, the tracker makes a serious error and continues in a downward direction opposite the true motion as a consequence of the usual reflective ambiguity under orthographic projection. This is shown by the dashed line 203. A correct tracker would follow the upward portion of the solid line 201.
In addition to singularity problems, tracking with 3-D kinematic models also requires the 3-D geometry of the object to be known in advance, particularly the lengths of the links. In order to track a particular person, the figure model must first be tuned so that the arms, legs, and torso have the correct dimensions. This can be non-trivial in practice, due to the difficulty of measuring the exact locations of the joint centers in the images.
In one prior method, a two stage tracking technique is used to track hand gestures. See Shimada et al. in "3-D Hand Pose Estimation and Shape Model Refinement from a Monocular Image Sequence," Intl. Conf. on Virtual Systems and Multimedia, pp. 423-428, Gifu, Japan, Sep. 18, 1996, and Shimada et al. in "Hand Gesture Recognition Using Computer Vision Based on Model-Matching Method," Sixth Intl. Conf. on Human-Computer Interaction, Yokohama, Japan, Jul. 9, 1995.
In their first stage, hands are tracked using a crude 3-D estimate of hand motion that is obtained by matching to extracted silhouettes. In their second stage, model parameters are adapted using an Extended Kalman Filter (EKF).
The first stage of their sampling is based on adaptive sampling of the state space, and requires a full 3-D model. This limits the method to situations where complete 3-D kinematic models are available. Furthermore, the adaptive sampling is dependent on the dimensions of the links, and requires separate models for hands of varying sizes.
The second stage adapts a previously specified 3-D kinematic model to a particular individual. This requires fairly close agreement between the original model and the subject, or else the EKF may fail to converge.
Another method is described by Ju et al. in "Cardboard people: A Parameterized Model of Articulated Image Motion," Intl. Conf. Automatic Face and Gesture Recognition, pp. 38-44, Killington, VT, 1996. There, each link is tracked with a separate template model, and adjacent templates are joined through point constraints. The method is not explicitly connected to any 3-D kinematic model, and, consequently, does not support 3-D reconstruction. In addition, the method requires a fairly large number of parameters which may degrade performance because noise is more likely to be introduced.
Therefore, there is a need for a method that can extract a 3-D kinematic model for an articulated figure from a monocular sequence of 2-D images. The 3-D model could then be used to reconstruct the figure in 3-D for any arbitrary camera or viewing angle.