Technologies for analyzing the motion of the human figure play a key role in a broad range of applications, including computer graphics, user-interfaces, surveillance, and video editing.
A motion of the figure can be represented as a trajectory in a state space which is defined by the kinematic degrees of freedom of the figure. Each point in state space represents a single configuration or pose of the figure. A motion such as a plié in ballet is described by a trajectory along which the joint angles of the legs and arms change continuously.
A key issue in human motion analysis and synthesis is modeling the dynamics of the figure. While the kinematics of the figure define the state space, the dynamics define which state trajectories are possible (or probable).
Since the key problem in synthesizing figure motion for animation is to achieve realistic dynamics, the importance of dynamic modeling is obvious. The challenge in animation is to produce motion with natural dynamics that satisfy constraints placed by the animator. Some constraints result from basic physical realities such as the noninterpenetration of objects. Others are artistic in nature, such as a desired head pose during a dance move. The key problem in synthesis is to find a trajectory in the set of dynamically realistic trajectories that satisfies the desired constraints.
Prior Approaches
Most previous work on synthesizing figure motion employs one of two types of dynamic models: analytic and learned. Analytic models are specified by a human designer. They are typically second order differential equations relating joint torque, mass, and acceleration. Learned models, on the other hand, are constructed automatically from examples of human motion data.
Analytic Dynamic Models
The prior art includes a range of hand-specified analytic dynamical models. On one end of the spectrum are simple generic dynamic models based, for example, on constant velocity assumptions. Complex, highly specific models occupy the other end.
A number of proposed figure trackers use a generic dynamic model based on a simple smoothness prior such as a constant velocity Kalman filter. See, for example, Ioannis A. Kakadiaris and Dimitris Metaxas, “Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection,” Computer Vision and pattern Recognition, pages 81–87, San Franciso, Calif., Jun. 18–20, 1996. Such models fail to capture subtle differences in dynamics across different motion types, such as walking or running. It is unlikely that these models can provide a strong constraint on complex human motion such as dance.
The field of biomechanics is a source of more complex and realistic models of human dynamics. From the biomechanics point of view, the dynamics of the figure are the result of its mass distribution, joint torques produced by the motor control system, and reaction forces resulting from contact with the environment, e.g., the floor. Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in complex, specialized models of human motion. For example, entire books have been written on the subject of walking. See, for example, Inman, Ralston and Todd, “Human Walking,” Williams and Wilkins, 1981.
The biomechanical approach has two drawbacks for analysis and synthesis applications. First, the dynamics of the figure are quite complex, involving a large number of masses and applied torques, along with reaction forces which are difficult to measure. In principle, all of these factors must be modeled or estimated in order to produce physically-valid dynamics. Second, in some applications we may only be interested in a small set of motions, such as a vocabulary of gestures. In the biomechanical approach, it may be difficult to reduce the complexity of the model to exploit this restricted focus. Nonetheless, these models have been applied to tracking and synthesis applications.
Wren and Pentland, “Dynamic models of human motion”, Proceeding of the Third International Conference on Automatic Face and Gesture Recognition, pages 22–27, Nara, Japan, 1998, explored visual tracking using a biomechanically-derived dynamic model of the upper body. The unknown joint torques were estimated along with the state of the arms and head in an input estimation framework. A Hidden Markov Model (HMM) was trained to represent plausible sequences of input torques. Due to the simplicity of their experimental domain, there was no need to model reaction forces between the figure and its environment.
This solution suffers from the limitations of the biomechanical approach outlined above. In particular, describing the entire body would require a significant increase in the complexity of the model. Even more problematic is the treatment of the reaction forces, such as those exerted by the floor on the soles of the feet during walking or running.
Biomechanically-derived dynamic models have also been applied to the problem of synthesizing athletic motion, such as bike racing or sprinting, for computer graphics animations. See, for example, Hodgins, Wooten, Brogan and O'Brien, “Animating human athletics,” Computer Graphics (Proc. SIGGRAPH '95), pages 71–78, 1995. In the present invention, there is, in addition to the usual problems of complex dynamic modeling, the need to design control programs that produce the joint torques that drive the figure model. In this approach, it is difficult to capture more subtle aspects of human motion without some form of automated assistance. The motions that result tend to appear very regular and robotic, lacking both the randomness and fluidity associated with natural human motion.
Learned Dynamic Models
The approaches to figure motion synthesis using learned dynamic models are based on synthesizing motion using dynamic models whose parameters are learned from a corpus of sample motions.
In Brand, “Pattern Discovery via Entropy Minimization,” Technical Report TR98–21, Mitsubishi Electric Research Lab, 1998, an HMM-based framework for dynamics learning is proposed and applied to synthesis of realistic facial animations from a training corpus. The main component of this work is the use of an entropic prior to cope with sparse input data.
Brand's approach has two potential disadvantages. First, it assumes that the resulting dynamic model is time invariant; each state space neighborhood has a unique distribution over state transitions. Second, the use of entropic priors results in fairly “deterministic” models learned from a moderate corpus of training data. In contrast, the diversity of human motion applications require complex models learned from a large corpus of data. In this situation, it is unlikely that a time invariant model will suffice, since different state space trajectories can originate from the same starting point depending upon the class of motion being performed.
Motion Capture for Motion Synthesis
A final category of prior art which is relevant to this invention is the use of motion capture to synthesize human motion with realistic dynamics. Motion capture is by far the most successful commercial technique for creating computer graphics animations of people. In this method, the motion of human actors is captured in digital form using a special suit with either optical or magnetic sensors or targets. This captured motion is edited and used to animate graphical characters.
The motion capture approach has two important limitations. First, the need to wear special clothing in order to track the figure limits the application of this technology to motion which can be staged in a studio setting. This rules out the live, real-time capture of events such as the Olympics, dance performances, or sporting events in which some of the finest examples of human motion actually occur.
The second limitation of current motion capture techniques is that they result in a single prototype of human motion which can only be manipulated in a limited way without destroying its realism. Using this approach, for example, it is not possible to synthesize multiple examples of the same type of motion which differ in a random fashion. The result of motion capture in practice is typically a kind of “wooden”, fairly inexpressive motion that is most suited for animating background characters. That is precisely how this technology is currently used in Hollywood movie productions.
There is a clear need for more powerful tracking techniques that can recover human motion under less restrictive conditions. Similarly, there is a need for more powerful generative models of human motion that are both realistic and capable of generating sample motions with natural amounts of “randomness.”