Technologies for analyzing the motion of the human figure play a key role in a broad range of applications, including computer graphics, user-interfaces, surveillance, and video editing. These applications cover three sub-tasks in motion analysis: synthesis, classification, and motion tracking.
A motion of the figure can be represented as a trajectory in a state space which is defined by the kinematic degrees of freedom of the figure. Each point in state space represents a single configuration or pose of the figure. A motion such as a plixc3xa9 in ballet is described by a trajectory along which the joint angles of the legs and arms change continuously.
A key issue in human motion analysis and synthesis is modeling the dynamics of the figure. While the kinematics of the figure define the state space, the dynamics define which state trajectories are possible (or probable).
In video tracking applications, a model of the figure kinematics is fit to an input video sequence, resulting in an estimated motion trajectory. Each point in the measured trajectory corresponds to a certain pose of the figure in a single video frame. Tracking is difficult because figure motion produces complex visual effects in a video sequence. While the skeleton itself can be approximated as a collection of articulated rigid links, its motion can only be measured indirectly through its effect on skin and clothing. Cloth and skin wrinkle and bulge as the figure moves, and changes in lighting and self-shadowing further complicate appearance modeling. In addition, self-occlusions of the figure, clutter in the background of the video, and the independent motion of the camera further complicate the task of estimating figure motion.
A dynamic model cab be a powerful cue in figure tracking, as it reduces the total space of possible configurations of the figure down to the set of trajectories that are consistent with the dynamics. In the simplest case, the dynamics can reflect the inertia of the figure and capture the fact that when an arm is swinging in an upward motion, it is more likely to continue swinging upward than, for example, to suddenly move down. This constraint can eliminate many incorrect poses of the figure in cases where the video data is ambiguous.
Even more effective tracking is possible when highly specific dynamic models are available for certain classes of motions. For example, the set of gestures that make up American Sign Language comprise only a small subset of the space of dynamically-feasible motions. A dynamic model that is tuned to this subset of gestures could provide even stronger constraints for visual tracking.
Tracking technology can play a critical role in applications such as video editing. Tracking can be used to build xe2x80x9chigh levelxe2x80x9d descriptions of video content based on the analysis of object motion. The ability to reliably track the motion of the figure, as well as the motion of the camera and other objects, is a key step in identifying the pixels in each frame that belong to a given object. Once this segmentation has been accomplished, an editing system can support high-level operations, such as removing people from, or adding people to, an existing video clip. Such simple to use but potentially very powerful editing tools could be particularly interesting in the consumer market, given the increasing popularity of digital video cameras that can be easily interfaced to PCs. Tracking technology also has many other applications to surveillance systems and user-interfaces.
Although the use of kinematic models in figure tracking is now commonplace, dynamic models have received relatively little attention. Most work on tracking employs one of two types of dynamic models: analytic or learned. Analytic models are specified by a human designer. They are typically second order differential equations relating joint torque, mass, and acceleration. Learned models, on the other hand, are constructed automatically from examples of human motion data.
The prior art includes a range of hand-specified analytic dynamical models. On one end of the spectrum are simple generic dynamic models based, for example, on constant velocity assumptions. Complex, highly specific models occupy the other end.
A number of proposed figure trackers use a generic dynamic model based on a simple smoothness prior such as a constant velocity Kalman filter. See, for example, Ioannis A. Kakadiaris and Dimitris Metaxas, xe2x80x9cModel-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection,xe2x80x9d Computer Vision and pattern Recognition, pages 81-87, San Franciso, Calif., Jun. 18-20, 1996. Such models fail to capture subtle differences in dynamics across different motion types, such as walking or running. It is unlikely that these models can provide a strong constraint on complex human motion such as dance.
The field of biomechanics is a source of more complex and realistic models of human dynamics. From the biomechanics point of view, the dynamics of the figure are the result of its mass distribution, joint torques produced by the motor control system, and reaction forces resulting from contact with the environment, e.g., the floor. Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in complex, specialized models of human motion. For example, entire books have been written on the subject of walking. See, for example, Inman, Ralston and Todd, xe2x80x9cHuman Walking,xe2x80x9d Williams and Wilkins, 1981.
The biomechanical approach has two drawbacks for analysis and synthesis applications. First, the dynamics of the figure are quite complex, involving a large number of masses and applied torques, along with reaction forces which are difficult to measure. In principle, all of these factors must be modeled or estimated in order to produce physically-valid dynamics. Second, in some applications we may only be interested in a small set of motions, such as a vocabulary of gestures. In the biomechanical approach, it may be difficult to reduce the complexity of the model to exploit this restricted focus. Nonetheless, these models have been applied to tracking and synthesis applications.
Wren and Pentland, xe2x80x9cDynamic models of human motionxe2x80x9d, Proceeding of the Third International Conference on Automatic Face and Gesture Recognition, pages 22-27, Nara, Japan, 1998, explored visual tracking using a biomechanically-derived dynamic model of the upper body. The unknown joint torques were estimated along with the state of the arms and head in an input estimation framework. A Hidden Markov Model (HMM) was trained to represent plausible sequences of input torques. Due to the simplicity of their experimental domain, there was no need to model reaction forces between the figure and its environment.
This solution suffers from the limitations of the biomechanical approach outlined above. In particular, describing the entire body would require a significant increase in the complexity of the model. Even more problematic is the treatment of the reaction forces, such as those exerted by the floor on the soles of the feet during walking or running.
Biomechanically-derived dynamic models have also been applied to the problem of synthesizing athletic motion, such as bike racing or sprinting, for computer graphics animations. See, for example, Hodgins, Wooten, Brogan and O""Brien, xe2x80x9cAnimating human athletics,xe2x80x9d Computer Graphics (Proc. SIGGRAPH ""95), pages 71-78, 1995. In the present invention, there is, in addition to the usual problems of complex dynamic modeling, the need to design control programs that produce the joint torques that drive the figure model. In this approach, it is difficult to capture more subtle aspects of human motion without some form of automated assistance. The motions that result tend to appear very regular and robotic, lacking both the randomness and fluidity associated with natural human motion.
Four earlier works have addressed the problem of learning complex dynamic models from data within a state space framework. The approaches are all based on building statistical models of motion trajectories whose parameters are learned from a corpus of sample motions.
Brand, xe2x80x9cPattern discovery via entorpy minimization,xe2x80x9d Technical Report TR98-21, Mitsubishi Electric Research Lab, 1998, available at http://www.merl.com/reports/TR98-21/index.html, proposes an HMM-based framework for dynamics learning and applies it to synthesize realistic facial animations from a training corpus. The main component of this work is the use of an entropic prior to cope with sparse input data.
Brand""s approach has two potential disadvantages. First, Brand assumes that the resulting dynamic model is time invariant; each state space neighborhood has a unique distribution over state transitions. Second, the use of entropic priors results in fairly xe2x80x9cdeterministicxe2x80x9d models learned from a moderate corpus of training data. In contrast, the diversity of human motion applications requires complex models learned from a large corpus of data. In this situation, it is unlikely that a time invariant model will suffice, since different state space trajectories can originate from the same starting point, depending upon the class of motion being performed.
Ghahramani and Roweis, xe2x80x9cLearning nonlinear stochastic dynamics using the generalized EM algorithm,xe2x80x9d NIPS ""99, Snowbird, Utah, 1999, use a Kalman smoother in conjunction with the generalized EM algorithm to learn a class of nonlinear dynamic models from input-output data. The requirement for computational tractability restricts the class of non-linearities in the model to a sum of Gaussian and affine kernels. Even though this approach attempts to explicitly model the non-linearity of the state transitions, it still suffers from the same time invariant restriction as the first approach.
Briegel and Tresp, xe2x80x9cA monte carlo generalized EM-type algorithm for state and parameter estimation in nonlinear state space models,xe2x80x9d Machines that Learn Workshop, Snowbird, Utah, 1998, along with Blake, North and Isard, xe2x80x9cLearning multi-class dynamics,xe2x80x9d NIPS ""98, 1998, have addressed the use of nonparametric probability density models to perform dynamics learning. Blake""s approach in particular has the ability to learn multiclass dynamics, meaning that the system can switch between multiple learned models. This may make it possible to learn time-varying models, unlike much of the other prior art.
However, the use of a nonparametric model can be inefficient in domains where linear Gaussian models are a powerful building block. Nonparametric methods are particularly expensive when applied to large state spaces, since they are exponential in the state space dimension. Complexities in the motion of the figure and its appearance suggest that a fairly large state space will be required for good performance.
A final piece of relevant prior art in the learning domain is the work of Yacoob and Davis, xe2x80x9cLearned temporal models of image motion,xe2x80x9d Computer Vision and Pattern Recognition, pages 446-453, 1998, in learning temporal models of motion in images. Unlike the more common state space models, this approach concentrates directly on the image space by representing any motion as a flow field in some particular flow field space. The basis of that space is learned from a corpus of examples. Hence, different bases capture distinct motion types. One drawback of this approach is that it only captures motion of a fairly fixed (and known) duration. For example, a prototypical walk of only one particular speed can be learned. Another disadvantage is that the models that result are highly viewpoint-specific, since they depend implicitly on the camera position. Furthermore, the approach is primarily suited for analysis rather than synthesis of motion sequences.
A final category of prior art which is relevant to this invention is the use of motion capture to synthesize human motion with realistic dynamics. Motion capture is by far the most successful commercial technique for creating computer graphics animations of people. In this method, the motion of human actors is captured in digital form, using a special suit with either optical or magnetic sensors or targets. This captured motion is edited and used to animate graphical characters.
The motion capture approach has two important limitations. First, the need to wear special clothing in order to track the figure limits the application of this technology to motion which can be staged in a studio setting. This rules out the live, real-time capture of events such as the Olympics, dance performances, or sporting events in which some of the finest examples of human motion actually occur.
The second limitation of current motion capture techniques is that they result in a single prototype of human motion which can only be manipulated in a limited way without destroying its realism. Using this approach, for example, it is not possible to synthesize multiple examples of the same type of motion which differ in a random fashion. The result of motion capture in practice is typically a kind of xe2x80x9cwooden,xe2x80x9d fairly inexpressive motion that is most suited for animating background characters. That is precisely how this technology is currently used in Hollywood movie productions.
There is a clear need for more powerful tracking techniques that can recover human motion under less restrictive conditions. Similarly there is a need for more powerful generative models of human motion that are both realistic and capable of generating sample motions with natural amounts of xe2x80x9crandomness.xe2x80x9d
We describe a novel approach to learning dynamic models from a training corpus of observed state space trajectories. In cases where sufficient training data is available, the learning approach provides flexibility and generality. A wide range of learning algorithms can be cast in the framework of Dynamic Bayesian Networks (DBNs). DBNs generalize two well-known signal modeling tools: Kalman filters for continuous state linear dynamic systems (LDS), and Hidden Markov Models (HMMs) for classification of discrete state sequences. See, for example, Anderson and Moore, xe2x80x9cOptimal filtering,xe2x80x9d Prentice-Hall, Inc., Englewood Cliffs, N.J., 1979, and Rabiner and Juang, xe2x80x9cFundamentals of Speech Recognition,xe2x80x9d Prentice Hall, Engelwood Cliffs, N.J., 1993.
We focus on a subclass of DBN models called Switching Linear Dynamics Systems (SLDSs) as described in, for example, Bar-Shalom and Li, xe2x80x9cEstimation and tracking: principles, techniques, and software,xe2x80x9d YBS, Storrs, Conn., 1998; Shumway and Stoffer, xe2x80x9cDynamic linear models with switching,xe2x80x9d Journal of the American Statistical Association, 86(415):763-769, September 1991; Kim, xe2x80x9cDynamic linear models with markov-switching,xe2x80x9d Journal of Econometrics, 60:1-22, 1994; Ghahramani and Hinton, xe2x80x9cSwitching state-space models,xe2x80x9d submitted for publication, 1989; Pavlovic, Frey and Huang, xe2x80x9cTime-series classification using mixed-state dynamic Bayesian networks,xe2x80x9d Computer Vision and Pattern Recognition, pages 609-615, June 1999.
Intuitively, these models attempt to describe a complex nonlinear dynamic system with a succession of linear models that are indexed by a switching variable. While other approaches, such as learning weighted combinations of linear models, are possible, the switching approach has an appealing simplicity and is naturally suited to the case where the dynamics are time-varying.
Therefore, there is a need for inference and learning methods for fully coupled SLDS models that can estimate a complete set of model parameters for a switching model given a training set of time-series data.
Described herein is a new class of approximate learning methods for switching linear dynamic (SLDS) models. These models consist of a set of linear dynamic system (LDS) models and a switching variable that indexes the active model. This new class has three advantages over dynamics learning methods known in the prior art:
New approximate inference techniques lead to tractable learning even when the set of LDS models is fully coupled.
The resulting models can represent time-varying dynamics, making them suitable for a wide range of applications.
All of the model parameters are learned from data, including the plant and noise parameters for the LDS models and Markov model parameters for the switching variable.
In addition, this method can be applied to the problem of learning dynamic models for human motion from data. It has three advantages over analytic dynamic models known in the prior art:
Models can be constructed without a laborious manual process of specifying mass and force distributions. Moreover, it may be easier to tailor a model to a specific class of motion, as long as a sufficient number of samples are available.
The same learning approach can be applied to a wide range of human motions from dancing to facial expressions.
When training data is obtained from analysis of video measurements, the spatial and temporal resolution of the video camera determine the level of detail at which dynamical effects can be observed. Learning techniques can only model structure which is present in the training data. Thus, a learning approach is well-suited to building models at the correct level of resolution for video processing and synthesis.
A wide range of learning algorithms can be cast in the framework of Dynamic Bayesian Networks (DBNs). DBNs generalize two well-known signal modeling tools: Kalman filters for continuous state linear dynamic systems (LDS) and Hidden Markov Models (HMMs) for discrete state sequences. Kalman filters are described in Anderson et al., xe2x80x9cOptimal Filteringxe2x80x9d, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1979. Hidden Markov Models are reviewed in Jelinek, xe2x80x9cStatistical methods for speech recognitionxe2x80x9d MIT Press, Cambridge, Mass., 1998.
Dynamic models learned from sequences of training data can be used to predict future values in a new sequence given the current values. They can be used to synthesize new data sequences that have the same characteristics as the training data. They can be used to classify sequences into different types, depending upon the conditions that produced the data.
Accordingly, a method for tracking a target in a sequence of measurements includes modeling the target with a switching linear dynamic system (SLDS) having a plurality of dynamic models. Each dynamic model is associated with a switching state such that a model is selected when its associated switching state is true. A set of continuous state estimates is determined for a given measurement, and for each possible switching state. A state transition record is then determined by determining and recording, for a given measurement and for each possible switching state, an optimal previous switching state, based on the measurement sequence, where the optimal previous switching state optimizes a transition probability based on the set of continuous state estimates. A measurement model of the target is fitted to the measurement sequence. The measurement model is the description of the influence of the state on the measurement. It couples what is observed to the estimated target. Finally, a trajectory of the target is estimated from the measurement model fitting, the state transition record and parameters of the SLDS, where the estimated trajectory is a sequence of continuous state estimates of the target which correspond to the measurement sequence.
In at least one embodiment of the present invention, the set of continuous state estimates is obtained through Viterbi prediction.
The optimal previous switching state can be an optimal prior switching state, and in one embodiment, the transition probability is dependent only upon Markov process probabilities.
Alternatively, the optimal previous switching state can be an optimal posterior switching state.
In one embodiment, the set of continuous state estimates is obtained by combining Viterbi predictions with samples drawn at random according to a continuous state sampling density. Furthermore, the set of continuous state estimates can be obtained by combining just a subset of Viterbi predictions with samples drawn at random according to a continuous state sampling density.
In one embodiment, the continuous state sampling density is given by a Viterbi mixture density.
In one embodiment, the set of continuous state estimates is updated based on the given measurement, and the optimal previous switching state optimizes a posterior transition probability over the updated set of state estimates.
The samples from a continuous state sampling density can be updated, for example, by a gradient descent procedure.
Alternatively, the samples can be updated by linearizing around sample positions and applying an Iterated Extended Kalman Filter.
In one embodiment, the measurement sequence comprises an image sequence. The transition probability is responsive to the comparison between an image feature model and the given image measurement.
The image feature model can be, for example, a template model or a contour model.
In one embodiment, the SLDS model can model, for example, the motion of a human figure, where the SLDS parameters may have been learned, for example, from training data containing figure motion.
Alternatively, the SLDS model can model the motions of a human face, where the SLDS parameters may have been learned, for example, from training data containing facial motion.
In yet another alternative, the SLDS model models the evolution of acoustic features in a speech waveform. The SLDS model can, for example, describe the dynamics of formants in a frequency-domain representation of speech.
In yet another embodiment, the SLDS model describes the evolution of financial data.