Technologies for analyzing the motion of the human figure play a key role in a broad range of applications, including computer graphics, user-interfaces, surveillance, and video editing. These applications cover three sub-tasks in motion analysis: synthesis, classification, and motion tracking.
A motion of the figure can be represented as a trajectory in a state space which is defined by the kinematic degrees of freedom of the figure. Each point in state space represents a single configuration or pose of the figure. A motion such as a plié in ballet is described by a trajectory along which the joint angles of the legs and arms change continuously.
A key issue in human motion analysis and synthesis is modeling the dynamics of the figure. While the kinematics of the figure define the state space, the dynamics define which state trajectories are possible (or probable).
In video tracking applications, a model of the figure kinematics is fit to an input video sequence, resulting in an estimated motion trajectory. Each point in the measured trajectory corresponds to a certain pose of the figure in a single video frame. Tracking is difficult because figure motion produces complex visual effects in a video sequence. While the skeleton itself can be approximated as a collection of articulated rigid links, its motion can only be measured indirectly through its effect on skin and clothing. Cloth and skin wrinkle and bulge as the figure moves, and changes in lighting and self-shadowing further complicate appearance modeling. In addition, self-occlusions of the figure, clutter in the background of the video, and the independent motion of the camera further complicate the task of estimating figure motion.
A dynamic model cab be a powerful cue in figure tracking, as it reduces the total space of possible configurations of the figure down to the set of trajectories that are consistent with the dynamics. In the simplest case, the dynamics can reflect the inertia of the figure and capture the fact that when an arm is swinging in an upward motion, it is more likely to continue swinging upward than, for example, to suddenly move down. This constraint can eliminate many incorrect poses of the figure in cases where the video data is ambiguous.
Even more effective tracking is possible when highly specific dynamic models are available for certain classes of motions. For example, the set of gestures that make up American Sign Language comprise only a small subset of the space of dynamically-feasible motions. A dynamic model that is tuned to this subset of gestures could provide even stronger constraints for visual tracking.
Tracking technology can play a critical role in applications such as video editing. Tracking can be used to build “high level” descriptions of video content based on the analysis of object motion. The ability to reliably track the motion of the figure, as well as the motion of the camera and other objects, is a key step in identifying the pixels in each frame that belong to a given object. Once this segmentation has been accomplished, an editing system can support high-level operations, such as removing people from, or adding people to, an existing video clip. Such simple to use but potentially very powerful editing tools could be particularly interesting in the consumer market, given the increasing popularity of digital video cameras that can be easily interfaced to PCs. Tracking technology also has many other applications to surveillance systems and user-interfaces.
Prior Approaches
Although the use of kinematic models in figure tracking is now commonplace, dynamic models have received relatively little attention. Most work on tracking employs one of two types of dynamic models: analytic or learned. Analytic models are specified by a human designer. They are typically second order differential equations relating joint torque, mass, and acceleration. Learned models, on the other hand, are constructed automatically from examples of human motion data.
Analytic Dynamic Models
The prior art includes a range of hand-specified analytic dynamical models. On one end of the spectrum are simple generic dynamic models based, for example, on constant velocity assumptions. Complex, highly specific models occupy the other end.
A number of proposed figure trackers use a generic dynamic model based on a simple smoothness prior such as a constant velocity Kalman filter. See, for example, Ioannis A. Kakadiaris and Dimitris Metaxas, “Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection,” Computer Vision and pattern Recognition, pages 81-87, San Franciso, Calif., Jun. 18-20, 1996. Such models fail to capture subtle differences in dynamics across different motion types, such as walking or running. It is unlikely that these models can provide a strong constraint on complex human motion such as dance.
The field of biomechanics is a source of more complex and realistic models of human dynamics. From the biomechanics point of view, the dynamics of the figure are the result of its mass distribution, joint torques produced by the motor control system, and reaction forces resulting from contact with the environment, e.g., the floor. Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in complex, specialized models of human motion. For example, entire books have been written on the subject of walking. See, for example, Inman, Ralston and Todd, “Human Walking,” Williams and Wilkins, 1981.
The biomechanical approach has two drawbacks for analysis and synthesis applications. First, the dynamics of the figure are quite complex, involving a large number of masses and applied torques, along with reaction forces which are difficult to measure. In principle, all of these factors must be modeled or estimated in order to produce physically-valid dynamics. Second, in some applications we may only be interested in a small set of motions, such as a vocabulary of gestures. In the biomechanical approach, it may be difficult to reduce the complexity of the model to exploit this restricted focus. Nonetheless, these models have been applied to tracking and synthesis applications.
Wren and Pentland, “Dynamic models of human motion”, Proceeding of the Third International Conference on Automatic Face and Gesture Recognition, pages 22-27, Nara, Japan, 1998, explored visual tracking using a biomechanically-derived dynamic model of the upper body. The unknown joint torques were estimated along with the state of the arms and head in an input estimation framework. A Hidden Markov Model (HMM) was trained to represent plausible sequences of input torques. Due to the simplicity of their experimental domain, there was no need to model reaction forces between the figure and its environment.
This solution suffers from the limitations of the biomechanical approach outlined above. In particular, describing the entire body would require a significant increase in the complexity of the model. Even more problematic is the treatment of the reaction forces, such as those exerted by the floor on the soles of the feet during walking or running.
Biomechanically-derived dynamic models have also been applied to the problem of synthesizing athletic motion, such as bike racing or sprinting, for computer graphics animations. See, for example, Hodgins, Wooten, Brogan and O'Brien, “Animating human athletics,” Computer Graphics (Proc. SIGGRAPH '95), pages 71-78, 1995. In the present invention, there is, in addition to the usual problems of complex dynamic modeling, the need to design control programs that produce the joint torques that drive the figure model. In this approach, it is difficult to capture more subtle aspects of human motion without some form of automated assistance. The motions that result tend to appear very regular and robotic, lacking both the randomness and fluidity associated with natural human motion.
Learned Dynamic Models
Four earlier works have addressed the problem of learning complex dynamic models from data within a state space framework. The approaches are all based on building statistical models of motion trajectories whose parameters are learned from a corpus of sample motions.
Brand, “Pattern discovery via entorpy minimization,” Technical Report TR98-21, Mitsubishi Electric Research Lab, 1998, available at http://www.merl.com/reports/TR98-21/index.html, proposes an HMM-based framework for dynamics learning and applies it to synthesize realistic facial animations from a training corpus. The main component of this work is the use of an entropic prior to cope with sparse input data.
Brand's approach has two potential disadvantages. First, Brand assumes that the resulting dynamic model is time invariant; each state space neighborhood has a unique distribution over state transitions. Second, the use of entropic priors results in fairly “deterministic” models learned from a moderate corpus of training data. In contrast, the diversity of human motion applications requires complex models learned from a large corpus of data. In this situation, it is unlikely that a time invariant model will suffice, since different state space trajectories can originate from the same starting point, depending upon the class of motion being performed.
Ghahramani and Roweis, “Learning nonlinear stochastic dynamics using the generalized EM algorithm,” NIPS '99, Snowbird, Utah, 1999, use a Kalman smoother in conjunction with the generalized EM algorithm to learn a class of nonlinear dynamic models from input-output data. The requirement for computational tractability restricts the class of non-linearities in the model to a sum of Gaussian and affine kernels. Even though this approach attempts to explicitly model the non-linearity of the state transitions, it still suffers from the same time invariant restriction as the first approach.
Briegel and Tresp, “A monte carlo generalized EM-type algorithm for state and parameter estimation in nonlinear state space models,” Machines that Learn Workshop, Snowbird, Utah, 1998, along with Blake, North and Isard, “Leaming multi-class dynamics,” NIPS '98, 1998, have addressed the use of nonparametric probability density models to perform dynamics learning. Blake's approach in particular has the ability to learn multiclass dynamics, meaning that the system can switch between multiple learned models. This may make it possible to learn time-varying models, unlike much of the other prior art.
However, the use of a nonparametric model can be inefficient in domains where linear Gaussian models are a powerful building block. Nonparametric methods are particularly expensive when applied to large state spaces, since they are exponential in the state space dimension. Complexities in the motion of the figure and its appearance suggest that a fairly large state space will be required for good performance.
A final piece of relevant prior art in the learning domain is the work of Yacoob and Davis, “Learned temporal models of image motion,” Computer Vision and Pattern Recognition, pages 446-453, 1998, in learning temporal models of motion in images. Unlike the more common state space models, this approach concentrates directly on the image space by representing any motion as a flow field in some particular flow field space. The basis of that space is learned from a corpus of examples. Hence, different bases capture distinct motion types. One drawback of this approach is that it only captures motion of a fairly fixed (and known) duration. For example, a prototypical walk of only one particular speed can be learned. Another disadvantage is that the models that result are highly viewpoint-specific, since they depend implicitly on the camera position. Furthermore, the approach is primarily suited for analysis rather than synthesis of motion sequences.
Motion Capture for Motion Synthesis
A final category of prior art which is relevant to this invention is the use of motion capture to synthesize human motion with realistic dynamics. Motion capture is by far the most successful commercial technique for creating computer graphics animations of people. In this method, the motion of human actors is captured in digital form, using a special suit with either optical or magnetic sensors or targets. This captured motion is edited and used to animate graphical characters.
The motion capture approach has two important limitations. First, the need to wear special clothing in order to track the figure limits the application of this technology to motion which can be staged in a studio setting. This rules out the live, real-time capture of events such as the Olympics, dance performances, or sporting events in which some of the finest examples of human motion actually occur.
The second limitation of current motion capture techniques is that they result in a single prototype of human motion which can only be manipulated in a limited way without destroying its realism. Using this approach, for example, it is not possible to synthesize multiple examples of the same type of motion which differ in a random fashion. The result of motion capture in practice is typically a kind of “wooden,” fairly inexpressive motion that is most suited for animating background characters. That is precisely how this technology is currently used in Hollywood movie productions.
There is a clear need for more powerful tracking techniques that can recover human motion under less restrictive conditions. Similarly there is a need for more powerful generative models of human motion that are both realistic and capable of generating sample motions with natural amounts of “randomness.”