The task of classifying sequences of data is a key technology in a broad range of applications. In econometrics and forecasting applications, classification of data sequences which can represent the inventory of product in a distribution channel over time, or the price of a stock or other financial instrument over time can be crucial. Different classes may, for instance, signify various important stock market states. In human motion applications, data sequences which represent the pose of the human body or one of its parts over time can be classified into categories. For example, a running motion can be distinguished from walking. Movements of hands can be automatically interpreted as individual signs in American Sign Language and then used to interface to a computer.
In classification applications, the goal is to assign class labels to observed data sequences or trajectories. ,Dynamic models can be useful in classification because they provide an efficient coding of the set of all possible trajectories. For example, suppose that two distinct gestures can be described with separate linear dynamical models. For each model, it is straightforward to compute an error signal, called the innovation, which measures the extent to which the model predicts the observations. This innovation can be used directly for classification. The parameters of the linear dynamic models therefore provide a very compact representation of the class of trajectories that make up a particular gesture.
Sets of dynamic models can be used to model qualitatively different regimes of a trajectory associated with one temporal event. For instance, a hand gesture can be segmented into three motion regimes or phases: preparation, stroke and retraction. Each regime can then be associated with a different linear model. The sequence of regimes can be governed by a model switching process.
Switching linear dynamic system (SLDS) models consist of a set of linear dynamic models and a switching variable that determines which model is in effect at any given point in time. In addition, fully connected SLDS models assume that there are temporal dependencies between the switching variables as well as the states of different linear dynamic models. SLDS models are attractive because they can describe complex dynamics using simple linear models as building blocks. Given an SLDS model, the inference problem is to estimate the sequence of model states that best explains an input data sequence. Unfortunately, exact inference in SLDS models is computationally intractable, due to the large number of possible combinations of linear models over time.
One prior method for inference using fully connected SLDS models is described in xe2x80x9cEstimation and Tracking: Principles, Techniques, and Softwarexe2x80x9d by Bar-Shalom et al., Artech House, Inc. 1993. In this method, approximate inference is achieved by truncating or collapsing the number of discrete components in the evolving model. The models were used to detect different motion regimes while tracking a maneuvering target.
In another prior method the switching variable determines which linear model is coupled to the measurement at each time instant. See Ghahramani et al., xe2x80x9cVariational Leaming for Switching State-Space Modelsxe2x80x9d which will appear in the journal Neural Computation. This method can produce decoupled linear models which reach steady-state before the data series is adequately modeled. It stands in contrast to the prior methods with fully coupled SLDS in which all models are coupled through a single state space. The method was used to segment regimes of no breathing and gasping breathing in data collected from patients with sleep apnea.
A different prior method in xe2x80x9cTime-series Classification Using Mixed-State Dynamic Bayesian Networks,xe2x80x9d by Pavlovic et al., Proc. of Computer Vision and Pattern Recognition, pages 609-615, June, 1999, considers a single linear dynamical model whose input is modeled as a discrete Markov process. This model explains all measurement variability as a consequence of the changes in input, which may not be true in general. The model was applied to classification of computer mouse-drawn symbols.
Another prior method proposes particle filters as an alternative to using linear models as the building blocks in a switching framework. See Blake et al., xe2x80x9cLearning Multi-Class Dynamics,xe2x80x9d Advances in Neural Information Processing Systems (NIPS ""98), pages 389-395, 1998. The use of a nonparametric, particle-based model can be inefficient in domains where linear models are a powerful building block. Nonparametric methods are particularly expensive when applied to large state spaces, since they are exponential in the state space dimension.
In another prior method, a Hidden Markov Model with an entropic prior is proposed for dynamics learning from sparse input data. See Brand, xe2x80x9cPattern discovery via entropy minimization,xe2x80x9d Technical Report TR98-21, Mitsubishi Electric Research Lab, 1998. The method is applied to the synthesis of facial animation and, to a certain extent, the segmentation of facial expressions from voice data, e.g., Brand, xe2x80x9cVoice puppetry,xe2x80x9d Proceeding of SIGGRAPH99, 1999. The dynamic models produced by this method are time invariant. Each state space neighborhood has a unique distribution over state transitions. In addition, the use of entropic priors results in fairly deterministic models learned from a moderate corpus of training data. In many applications, time-invariant models are unlikely to succeed, since different state space trajectories can originate from the same starting point, depending upon the class of motion being performed.
Technologies for analyzing the motion of the human figure play a key role in a broad range of applications, including computer graphics, user-interfaces, surveillance, and video editing. A motion of the figure can be represented as a trajectory in a state space which is defined by the kinematic degrees of freedom of the figure. Each point in state space represents a single configuration or pose of the figure. A motion such as a plixc3xa9 in ballet is described by a trajectory along which the joint angles of the legs and arms change continuously.
A key issue in human motion analysis and synthesis is modeling the dynamics of the figure. While the kinematics of the figure define the state space, the dynamics define which state trajectories are possible (or probable) in that state space. Prior methods for representing human dynamics have been based on analytic dynamic models. Analytic models are specified by a human designer. They are typically second order differential equations relating joint torque, mass, and acceleration.
The field of biomechanics is a source of complex and realistic analytic models of human dynamics. From the biomechanics point of view, the dynamics of the figure are the result of its mass distribution, joint torques produced by the motor control system, and reaction forces resulting from contact with the environment, e.g., the floor. Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in complex, specialized models of human motion. For example, detailed walking models are described in Inman et al., xe2x80x9cHuman Walking,xe2x80x9d Williams and Wilkins, 1981.
The biomechanical approach has two drawbacks. First, the dynamics of the figure are quite complex, involving a large number of masses and applied torques, along with reaction forces which are difficult to measure. In principle, all of these factors must be modeled or estimated in order to produce physically valid dynamics. Second, in some applications, we may only be interested in a small set of motions, such as a vocabulary of gestures. In the biomechanical approach it may be difficult to reduce the complexity of the model to exploit this restricted focus. Nonetheless, biomechanical models have been applied to human motion analysis.
A prior method for visual tracking uses a biomechanically-derived dynamic model of the upper body. See Wren et al., xe2x80x9cDynamic models of human motion,xe2x80x9d Proceeding of the Third International Conference on Automatic Face and Gesture Recognition, pages 22-27, Nara, Japan, 1998. The unknown joint torques are estimated along with the state of the arms and head in an input estimation framework. A Hidden Markov Model is trained to represent plausible sequences of input torques. This prior art does not address the problem of modeling reaction forces between the figure and its environment. An example is the reaction force exerted by the floor on the soles of the feet during walking or running.
Therefore, there is a need for classification methods for fully coupled SLDS models that can segment data sequences into regimes.
Technologies for analyzing the motion of the human figure play a key role in a broad range of applications, including computer graphics, user-interfaces, surveillance, and video editing. A motion of the figure can be represented as a trajectory in a state space which is defined by the kinematic degrees of freedom of the figure. Each point in state space represents a single configuration or pose of the figure. A motion such as a plixc3xa9 in ballet is described by a trajectory along which the joint angles of the legs and arms change continuously.
A key issue in human motion analysis and synthesis is modeling the dynamics of the figure. While the kinematics of the figure define the state space, the dynamics define which state trajectories are possible (or probable) in that state space. Prior methods for representing human dynamics have been based on analytic dynamic models. Analytic models are specified by a human designer. They are typically second order differential equations relating joint torque, mass, and acceleration.
The field of biomechanics is a source of complex and realistic analytic models of human dynamics. From the biomechanics point of view, the dynamics of the figure are the result of its mass distribution, joint torques produced by the motor control system, and reaction forces resulting from contact with the environment (e.g. the floor). Research efforts in biomechanics, rehabilitation, and sports medicine have resulted in complex, specialized models of human motion. For example, detailed walking models are described in Inman et al., xe2x80x9cHuman Walking,xe2x80x9d Williams and Wilkins, 1981.
The biomechanical approach has two drawbacks. First, the dynamics of the figure are quite complex, involving a large number of masses and applied torques, along with reaction forces which are difficult to measure. In principle all of these factors must be modeled or estimated in order to produce physically-valid dynamics. Second, in some applications we may only be interested in a small set of motions, such as a vocabulary of gestures. In the biomechanical approach it may be difficult to reduce the complexity of the model to exploit this restricted focus. Nonetheless, biomechanical models have been applied to human motion analysis.
A prior method for visual tracking uses a biomechanically-derived dynamic model of the upper body. See Wren et al., xe2x80x9cDynamic models of human motion,xe2x80x9d Proceeding of the Third International Conference on Automatic Face and Gesture Recognition, pages 22-27, Nara, Japan, 1998. The unknown joint torques are estimated along with the state of the arms and head in an input estimation framework. A Hidden Markov Model is trained to represent plausible sequences of input torques. This prior art does not address the problem of modeling reaction forces between the figure and its environment. An example is the reaction force exerted by the floor on the soles of the feet during walking or running.
We present a method for classification of data sequences modeled as fully coupled switching linear dynamic models (SLDSs). The method uses approximate Viterbi inference to find a most likely sequnece of switching states.
The classification method can be used to classify motion regimes learned from figure motion data corresponding to classes of human activity such as running, walking, etc. Inference produces a single sequence of switching modes which best describes a motion trajectory in the figure state space. This sequence segments the figure motion trajectory into motion regimes learned from data.
Described herein is a new class of approximate learning methods for switching linear dynamic (SLDS) models. These models consist of a set of linear dynamic system (LDS) models and a switching variable that indexes the active model. This new class has three advantages over dynamics learning methods known in the prior art:
New approximate inference techniques lead to tractable learning even when the set of LDS models is fully coupled.
The resulting models can represent time-varying dynamics, making them suitable for a wide range of applications.
All of the model parameters are learned from data, including the plant and noise parameters for the LDS models and Markov model parameters for the switching variable.
In addition, this method can be applied to the problem of learning dynamic models for human motion from data. It has three advantages over analytic dynamic models known in the prior art:
Models can be constructed without a laborious manual process of specifying mass and force distributions. Moreover, it may be easier to tailor a model to a specific class of motion, as long as a sufficient number of samples are available.
The same learning approach can be applied to a wide range of human motions from dancing to facial expressions.
When training data is obtained from analysis of video measurements, the spatial and temporal resolution of the video camera determine the level of detail at which dynamical effects can be observed. Learning techniques can only model structure which is present in the training data.
Thus, a learning approach is well-suited to building models at the correct level of resolution for video processing and synthesis.
A wide range of learning algorithms can be cast in the framework of Dynamic Bayesian Networks (DBNs). DBNs generalize two well-known signal modeling tools: Kalman filters for continuous state linear dynamic systems (LDS) and Hidden Markov Models (HMMs) for discrete state sequences. Kalman filters are described in Anderson et al., xe2x80x9cOptimal Filtering,xe2x80x9d Prentice-Hall, Inc., Englewood Cliffs, N.J., 1979. Hidden Markov Models are reviewed in Jelinek, xe2x80x9cStatistical methods for speech recognition,xe2x80x9d MIT Press, Cambridge, Mass., 1998.
Dynamic models learned from sequences of training data can be used to predict future values in a new sequence given the current values. They can be used to synthesize new data sequences that have the same characteristics as the training data. They can be used to classify sequences into different types, depending upon the conditions that produced the data.
We focus on a subclass of DBN models called Switching Linear Dynamics Systems. Intuitively, these models attempt to describe a complex nonlinear dynamic system with a succession of linear models that are indexed by a switching variable. The switching approach has an appealing simplicity and is naturally suited to the case where the dynamics are time-varying.
We present a method for approximate inference in fully coupled switching linear dynamic models (SLDSs). Exponentially hard exact inference is replaced with approximate inference of reduced complexity.
The first preferred embodiment uses Viterbi inference jointly in the switching and linear dynamic system states.
The second preferred embodiment uses variational inference jointly in the switching and linear dynamic system states.
Parameters of a fully connected SLDS model are learned from data. Model parameters are estimated using a generalized expectation-maximization (EM) algorithm. Exact expectation/inference (E) step is replaced with one of the three approximate inference embodiments.
The learning method can be used to model the dynamics of human motion. The joint angles of the limbs and pose of the torso are represented as state variables in a switching linear dynamic model. The switching variable identifies a distinct motion regime within a particular type of human motion.
Motion regimes learned from figure motion data correspond to classes of human activity such as running, walking, etc. Inference produces a single sequence of switching modes which best describes a motion trajectory in the figure state space. This sequence segments the figure motion trajectory into motion regimes learned from data.
Accordingly, a method for classifying portions of an input measurement sequence into a plurality of regimes includes associating each of a plurality of dynamic models with one a switching state such that a model is selected when its associated switching state is true. A state transition record is determined by determining and recording, for a given measurement of the sequence and for each switching state, an optimal prior switching state, based on the input sequence, where the optimal prior switching state optimizes a transition probability. An optimal final switching state is determined for a final measurement. A switching state sequence is determined by backtracking through the state transition record from the optimal final switching state. Finally, portions of the input sequence are classified into different regimes, responsive to the switching state sequence.
In one embodiment of the present invention, classifying depends upon conditions existing at the time the sequence was created.
Regimes can be, for example, motion regimes, and in particular, human motion. Human motion includes, but is not limited to, walking, jogging, running, jumping, sitting, and climbing, and ascending and descending a staircase.
An embodiment of the present invention is capable of identifying an individual based on observed dynamics of the individual""s motion in an image sequence. This can be useful, for example, in identifying a criminal suspect from the image sequence.
Another embodiment classifies sequences into motions to conduct surveillance. For example, certain activities, such as opening a door or dropping a package, can be identified by classification.
In one embodiment, one or more constraints can be imposed on the classification.
Each motion can be an individual sign of a sign language
In yet another embodiment, classification of a motion serves as input to a computer user interface.
In yet another embodiment, sets of dynamic models are used to model qualitatively different regimes of a trajectory with a single temporal event.
Yet another embodiment selects key frames from an input sequence in response to the classification, and performs video compression by transmitting key frames at a low sampling rate.
While the above embodiments are based on Viterbi techniques, other embodiments of the present invention are based on variational techniques. For example, a method for classifying portions of an input sequence of measurements into a plurality of regimes, given a set of possible switching states, includes associating each of a plurality of dynamic models with a switching state such that a dynamic model is selected when its associated switching state is true, where the switching state at a particular instance is determined by a switching model. The dynamic model is then decoupled from the switching model. Parameters of the decoupled dynamic model are determined responsive to a switching state probability estimate. A state of the decoupled dynamic model corresponding to a measurement at the particular instance is estimated, responsive to the input sequence. Parameters of the decoupled switching model are then determined responsive to the dynamic state estimate. A probability is estimated for each possible switching state of the decoupled switching model. A switching state sequence is determined based on the estimated switching state probabilities. Finally, portions of the input sequence are classified into different regimes, responsive to the determined switching state sequence.