A conventional video recognition system automatically detects (in software) an occurrence of a particular event of interest in a large corpus of video data. The events may happen infrequently, over a short period of time, and may comprise a small fraction of the corpus of video data.
Each event may vary in appearance and dynamic characteristics causing recognition failures. Also, failure of recognition may be caused by changes in relative position, speed, size, etc. of objects involved in the event. There are two conventional approaches addressing these types of failures: a rule-based method and a probabilistic method.
The rule-based method relies on direct models of events and cannot easily incorporate uncertainty reasoning. This results in a lack of robustness over variation in appearance and dynamic characteristics.
The probabilistic method performs uncertainty reasoning, but event models must be learned from training examples. This typically requires many training examples, covering a large range of variability, to establish parameters of the model. Often this training data is not available, particularly for the unusual events that are typically of most interest.
A user may create an event model for an event of interest by specifying objects involved in the event, roles of those objects, semantic spatial-dynamic relations between the objects, and temporal constraints of the interaction between objects. The spatial relations may be encoded in a binarized vector representation. The temporal constraints and uncertainty may be expressed using a Hidden Markov Model (HMM) framework.
A Hidden Markov Model is a doubly stochastic process consisting of a state transition model, {aij:1≦i,j≦N} where N is the number of states, and a set of observation probability density functions (pdfs). In recognition, the objective is to recover the most likely sequence of hidden states, given a sequence of feature observations {ot:1<t<T}. The observation densities bj(o), which depend on the state j the process is in at time t, can be continuous or discrete.
This representation may decouple the underlying states of interest and the observation models, allowing uncertainty and variation to be incorporated. A left-right HMM for representing the temporal constraints in time-series data, as in the case of video data, may be used.
Typical applications of HMMs for recognition involve modeling the trajectories of some observable objects, often using Gaussian distributions or mixtures of Gaussian distributions. Given enough examples of each category to be recognized, parameters of the HMM may be learned, such as very detailed distributions of temporal trajectories. However, it may be difficult for a model to process unseen data without adequate training data.
Furthermore, an optimal number of states is typically experimentally learned. Semantic meanings may be difficult to attach to the states after this experimental learning.