The problem of fitting dynamic models to sequences of data is a key technology in many applications. In econometrics and forecasting applications, data sequences may represent product inventory in a distribution channel over time, or the price of a stock or other financial instrument over time. In human motion applications, data sequences could represent the pose of the human body over time. For example, a motion such as a ballet plixc3xa9 can be described as a sequence of smooth changes in the angles of the arms and legs and pose of the torso.
Dynamic models can also be applied to modeling of spatial data sequences, such as genes or constrained kinematic chains.
The most basic example of learning the parameters of a dynamic model from data is the system identification problem for a single linear dynamic model. System identification is described in Ljung et al., xe2x80x9cTheory and Practice of Recursive Estimation,xe2x80x9d MIT Press, 1983. Unfortunately, a single linear model is incapable of representing a broad range of interesting data sequences.
A switching linear dynamic system (SLDS) model consists of a set of linear dynamic models and a switching variable that determines which model is in effect at any given point in time. A xe2x80x9cfully connectedxe2x80x9d SLDS further assumes that there are temporal dependencies between the values of the switching variable at different times, as well as the states of different linear dynamic models. SLDS models are attractive because they can describe complex dynamics using simple linear models as building blocks. Given an SLDS model, the inference problem is to estimate the sequence of model states that best explains a measurement data sequence. Unfortunately, exact inference in SLDS models is computationally intractable, due to the large number of possible combinations of linear models over time.
Consider the case where there are r linear models in an SLDS model, and assume that the goal is to infer which model best explains each element in a measurement data sequence of length n. For the first element there are r possibilities. For two sequential elements there are r*r possibilities. There are rn total possibilities for the entire data set or sequence of n elements. It is infeasible to examine each of these possibilities to determine the exact, optimal solution.
One prior method for inference using fully connected SLDS models is described in Bar-Shalom et al., xe2x80x9cEstimation and Tracking: Principles, Techniques, and Software,xe2x80x9d Artech House, Inc. 1993 and in Kim, xe2x80x9cDynamic Linear Models With Markov-Switching,xe2x80x9d Journal of Econometrics, volume 60, pages 1-22, 1994. In this method, approximate inference is achieved by truncating or collapsing the number of discrete components in the evolving model. The models are used to detect different motion regimes while tracking a maneuvering target.
Additional approximate smoothing of the switching states is described in Kim. Neither of these two references tackle smoothing of the linear dynamic system states.
In most applications, it is not practical to build an SLDS model by hand and it is therefore desirable to learn the parameters of these models from training data. A method for SLDS learning is described in Shumway et al., xe2x80x9cDynamic Linear Models with Switching,xe2x80x9d Journal of the American Statistical Association, 86(415), pages 763-769, September 1991. It assumes that the SLDS is not fully connected, i.e., the switching variable has no temporal dependencies. It also assumes that a prior distribution for the switching variable is known for each time instant. These assumptions do not hold for a broad range of practical applications.
Krishnamurthy et al., xe2x80x9cFinite-dimensional Filters for Passive Tracking of Markov Jump Linear Systems,xe2x80x9d Automatica, 34(6), pages 765-770, 1998, assumes that the switching variable follows a Markov process model and that observations of the switching variable are available for each time instant. However, for a broad range of practical applications, these observations are not available.
In another method, the switching variable determines which linear model is coupled to the measurement at each time instant. See Ghahramani et al., xe2x80x9cVariational Learning for Switching State-Space Models,xe2x80x9d which will appear in the journal Neural Computation. This method can produce decoupled linear models which reach steady-state before the data series is adequately modeled. It stands in contrast to methods with fully coupled SLDS in which all models are coupled through a single state space.
Pavlovic, Frey and Huang, xe2x80x9cTime-series Classification Using Mixed-State Dynamic Bayesian Networks,xe2x80x9d Proc. of Computeor Vision and Pattern Recognition, pages 609-615, June, 1999, considers a single linear dynamical model whose input is modeled as a discrete Markov process. This model explains all measurement variability as a consequence of the changes in input, which may not be true in general.
Blake et al., xe2x80x9cLearning Multi-Class Dynamics,xe2x80x9d Advances in Neural Information Processing Systems (NIPS ""98), pages 389-395, 1998, proposes particle filters as an alternative to using linear models as the building blocks in a switching framework. The use of a nonparametric, particle-based model can be inefficient in domains where linear models are a powerful building block. Nonparametric methods are particularly expensive when applied to large state spaces, since they are exponential in the state space dimension.
In Brand, xe2x80x9cPattern discovery via entropy minimizationxe2x80x9d, Technical Report TR98-21, Mitsubishi Electric Research Lab, 1998, a Hidden Markov Model with an entropic prior is proposed for dynamics learning from sparse input data. The method is applied to the synthesis of facial animation. The dynamic models produced by this method are time invariant. Each state space neighborhood has a unique distribution over state transitions. In addition, the use of entropic priors results in fairly deterministic models learned from a moderate corpus of training data. In many applications time-invariant models are unlikely to succeed, since different state space trajectories can originate from the same starting point depending upon the class of motion being performed.
In Ghahramani et al., xe2x80x9cLearning Nonlinear Stochastic Dynamics Using the Generalized EM Algorithm,xe2x80x9d Advances in Neural Information Processing Systems (NIPS ""99), pages 599-605, 1999, a Kalman smoother is used in conjunction with the generalized EM algorithm to learn a class of nonlinear dynamic models from input-output data. This approach also results in time-invariant models.
Another method addresses the problem of learning temporal models of motion in images. Unlike the more common state space models, this approach concentrates directly on the image space by representing any motion as a flow field in some particular flow field space. The basis of that space is learned from a corpus of examples. Hence, different bases capture distinct motion types. See Yacoob et al., xe2x80x9cLearned Temporal Models of Image Motion,xe2x80x9d Proceedings of Computer Vision and Pattern Recognition, pages 446-453, 1998.
One drawback of this method is that it only captures motion of a fairly fixed (and known) duration. For example, a prototypical walk of only one particular speed can be learned. Another disadvantage is that the models that result are highly viewpoint-specific, since they depend implicitly on the camera position. Furthermore, the approach is primarily suited for analysis rather than synthesis of motion sequences.
Technologies for analyzing the motion of the human figure play a key role in a broad range of applications, including computer graphics, user-interfaces, surveillance, and video editing. A motion of the figure can be represented as a trajectory in a state space which is defined by the kinematic degrees of freedom of the figure. Each point in state space represents a single configuration or pose of the figure. A motion such as a plixc3xa9 in ballet is described by a trajectory along which the joint angles of the legs and arms change continuously.
A key issue in human motion analysis and synthesis is modeling the dynamics of the figure. While the kinematics of the figure define the state space, the dynamics define which state trajectories are possible (or probable) in that state space. Prior methods for representing human dynamics have been based on analytic dynamic models. Analytic models are specified by a human designer. They are typically second order differential equations relating joint torque, mass, and acceleration.
The field of biomechanics is a source of complex and realistic analytic models of human dynamics. From the biomechanics point of view, the dynamics of the figure are the result of its mass distribution, joint torques produced by the motor control system, and reaction forces resulting from contact with the environment (e.g. the floor). Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in complex, specialized models of human motion. For example, detailed walking models are described in Inman, Ralston and Todd, xe2x80x9cHuman Walking,xe2x80x9d Williams and Wilkins, 1981.
The biomechanical approach has two drawbacks. First, the dynamics of the figure are quite complex, involving a large number of masses and applied torques, along with reaction forces which are difficult to measure. In principle, all of these factors must be modeled or estimated in order to produce physically-valid dynamics. Second, in some applications we may only be interested in a small set of motions, such as a vocabulary of gestures. In the biomechanical approach, it may be difficult to reduce the complexity of the model to exploit this restricted focus. Nonetheless, biomechanical models have been applied to human motion analysis.
A prior method for visual tracking uses a biomechanically-derived dynamic model of the upper body. See Wren and Pentland, xe2x80x9cDynamic models of human motion,xe2x80x9d Proceeding of the Third Intermational Conference on Automatic Face and Gesture Recognition, pages 22-27, Nara, Japan, 1998. The unknown joint torques are estimated along with the state of the arms and head in an input estimation framework. A Hidden Markov Model is trained to represent plausible sequences of input torques. This prior art does not address the problem of modeling reaction forces between the figure and its environment. An example is the reaction force exerted by the floor on the soles of the feet during walking or running.
In view of the above discussion, therefore, there is a need for inference and learning methods for fully coupled SLDS models that can estimate a complete set of model parameters for a switching model given a training set of time-series data.
Described herein is a new class of approximate learning methods for switching linear dynamic (SLDS) models. These models consist of a set of linear dynamic system (LDS) models and a switching variable that indexes the active model. This new class has at least three advantages over dynamics learning methods known in the prior art:
New approximate inference techniques lead to tractable learning even when the set of LDS models is fully coupled.
The resulting models can represent time-varying dynamics, making them suitable for a wide range of applications.
All of the model parameters are learned from data, including the plant and noise parameters for the LDS models and Markov model parameters for the switching variable.
In addition, this method can be applied to the problem of learning dynamic models for human motion from data. It has at least three advantages over existing analytic dynamic models.
First, models can be constructed without a laborious manual process of specifying mass and force distributions. Moreover, it may be easier to tailor a model to a specific class of motion, as long as a sufficient number of samples are available.
Second, the same learning approach can be applied to a wide range of human motions from dancing to facial expressions.
Third, when training data is obtained from analysis of video measurements, the spatial and temporal resolution of the video camera determine the level of detail at which dynamical effects can be observed. Learning techniques can only model structure which is present in the training data. Thus, a learning approach is well-suited to building models at the correct level of resolution for video processing and synthesis.
A wide range of learning algorithms can be cast in the framework of Dynamic Bayesian Networks (DBNs). DBNs generalize two well-known signal modeling tools: Kalman filters for continuous state linear dynamic systems (LDS) and Hidden Markov Models (HMMs) for discrete state sequences. Kalman filters are described in Anderson et al., xe2x80x9cOptimal Filtering,xe2x80x9d Prentice-Hall, Inc., Englewood Cliffs, N.J., 1979. Hidden Markov Models are reviewed in Jelinek, xe2x80x9cStatistical methods for speech recognition,xe2x80x9d MIT Press, Cambridge, Mass., 1998.
Dynamic models learned from sequences of training data can be used to predict future values in a new sequence given the current values. They can be used to synthesize new data sequences that have the same characteristics as the training data. They can be used to classify sequences into different types, depending upon the conditions that produced the data.
We focus on a subclass of DBN models called Switching Linear Dynamics Systems. Intuitively, these models attempt to describe a complex nonlinear dynamic system with a succession of linear models that are indexed by a switching variable. The switching approach has an appealing simplicity and is naturally suited to the case where the dynamics are time-varying.
We present a method for approximate inference in fully coupled SLDS models. Exponentially hard exact inference is replaced with approximate inference of reduced complexity.
A first preferred embodiment uses Viterbi inference jointly in the switching and linear dynamic system states. A second preferred embodiment uses variational inference jointly in the switching and linear dynamic system states. A third preferred embodiment uses general pseudo Bayesian inference jointly in the switching and linear dynamic system states.
Parameters of a fully connected SLDS model are learned from data. Model parameters are estimated using a generalized expectation-maximization (EM) algorithm. Exact expectation/inference (E) step is replaced with one of the three approximate inference embodiments.
The learning method can be used to model the dynamics of human motion. The joint angles of the limbs and pose of the torso are represented as state variables in a switching linear dynamic model. The switching variable identifies a distinct motion regime within a particular type of human motion.
Motion regimes learned from figure motion data correspond to classes of human activity such as running, walking, etc. Inference produces a single sequence of switching modes which best describes a motion trajectory in the figure state space. This sequence segments the figure motion trajectory into motion regimes learned from data.
Accordingly, a method for determining, from a set of possible switching states and responsive to a sequence of measurements, a corresponding sequence of switching states for a system having a plurality of dynamic models, associates each model with a switching state such that a model is selected when its associated switching state is true. A state transition record is determined by determining and recording, for each possible switching state, an optimal prior switching state, based on the measurement sequence, where the optimal state optimizes a transition probability. For a final measurement, a most-probable final switching state is determined. Finally, the sequence of switching states is determined by backtracking, from the most-probable final switching state, through the state transition record.
Preferably, the switching states comprise a Markov chain.
In at least one embodiment, the transition probability is based on the likelihood of the measurement sequence, which may be determined with a Kalman filter, and the probability of transition between the switching states.
In at least one embodiment, the sequence is a time sequence. Each measurement corresponds to a distinct time.
In addition, at least one embodiment includes providing training data and learning parameters of the dynamic models in response to the determined sequence of switching states which result from the training data.
Future values in a new sequence may be predicted based on dynamic models learned from training sequences and the current values.
The sequence can comprise, for example, economic data, image or video data, audio data or spatial data. Audio data can include, for example, human speech, and more specifically, phonemes.
While the above embodiments are based on Viterbi techniques, other embodiments of the present invention are based on variational techniques. For example, a method for determining, from a set of possible switching states and responsive to a sequence of measurements, a corresponding sequence of switching states for a system having a plurality of dynamic models, includes defining a switching linear dynamic system (SLDS) having a plurality of dynamic models. Each dynamic model is associated with a switching state such that a dynamic model is selected when its associated switching state is true. The switching state at a particular instance is determined by a switching model, such as a hidden Markov model (HMM). The dynamic models are decoupled from the switching model, and parameters of the decoupled dynamic model are determined responsive to a switching state probability estimate. A state of a decoupled dynamic model corresponding to a measurement at the particular instance is estimated, responsive to one or more training sequences. Parameters of the decoupled switching model, which can include both input and output parameters, are then determined, responsive to the dynamic state estimate. A probability is estimated for each possible switching state of the decoupled switching model. The sequence of switching states is determined based on the estimated switching state probabilities.
Parameters of the dynamic models can be learned responsive to the determined sequence of switching states.
The most basic example of dynamics learning is system identification for a single linear model. This is a well-understood problem. See, for example, Ljung and Sxc3x6derstrxc3x6m, xe2x80x9cTheory and Practice of Recursive Identification,xe2x80x9d MIT Press, 1983. SLDS models and their equivalents have been studied in statistics, time-series modeling, and target tracking since the early 1970""s. However, the complete learning framework we describe has never appeared in the literature.
Bar-Shalom and Li, xe2x80x9cEstimation and tracking: principles, techniques, and software,xe2x80x9d YBS, Storrs, Conn., 1998 and Kim, xe2x80x9cDynamic linear models with markov-switching,xe2x80x9d Journal of Econometrics, 60:1-22, 1994, have developed a number of approximate pseudo-Bayesian inference techniques based on mixture component truncation or collapsing in SLDSs. They did not address the issue of learning system parameters. Shumway and Stoffer, xe2x80x9cDynamic linear models with switching,xe2x80x9d Journal of the American Statistical Association, 86(415):763-769, September 1991, presented a systematic view of inference and learning in SLDS while assuming known prior switching state distributions at each time instance, Pr(st)=xcfx80t(i) and no temporal dependency between switching states. Krishnamurthy and Evans, xe2x80x9cFinite-dimensional filters for passive tracking of markov jump linear systems,xe2x80x9d Automatica, 34(6):765-770, 1998, imposed markov dynamics on the switching model. However, they assumed that noisy measurements of the switching states are available.
Ghahramani and Hinton, xe2x80x9cVariational Learning for Switching State-Space Models,xe2x80x9d submitted for publication in Neural Computation, 1998, to be published April, 2000, and incorporated herein by reference in its entirety, introduced a DBN-framework for learning and approximate inference in one class of SLDS models. Their underlying model differs from ours in assuming the presence of S independent, white noise-driven LDSs whose measurements are selected by the Markov switching process. Their assumption may lead to, among other things, measurements of processes that are, after some sufficient time, all in steady states and not changing. On the other hand, our model avoids this pitfall and may yield, if necessary, a dynamic set of measurements.
A switching model framework for particle filters is described in Isard and Blake, xe2x80x9cA mixed-state CONDENSATION tracker with automatic model-switching,xe2x80x9d Proceedings of International Conference on Computer Vision, pages 107-112, Bombay, India, 1998, and applied to dynamics learning in Blake, North and Isard, xe2x80x9cLearning multi-class dynamics,xe2x80x9d NIPS ""98, 1998.
Manifold learning, as described by Bregler and Omohundro, xe2x80x9cNonlinear manifold learning for visual speech recognition,xe2x80x9d Proceedings of International Conference on Computer Vision, pages 494-499, Cambridge, Mass., June 1995, is another approach to constraining the set of allowable trajectories within a high dimensional state space.
The learning framework of the present invention has at least three advantages over the alternative of manually deriving physically realistic dynamic models:
The first advantage is convenience. Models can be constructed without laborious manual process of specifying mass and force distributions. Moreover, it may be easier to tailor a model to a specific class of motion, as long as a sufficient number of samples are available.
A second advantage is resolution. The spatial and temporal resolution of a video source determine the level of detail at which dynamic effects can be observed. Since a learning technique can only model structure which is present in the visual signal, it is naturally tuned to building models at the correct level of resolution.
A third advantage is generality. The same learning approach can be applied to a wide range of human motions from dancing to facial expressions.
A primary advantage of our learning framework over previous learning techniques is the ability to learn time-varying models. Moreover, our framework is comprehensive in that all of the model parameters can be learned from data.