1. Technical Field
The invention is related to a system for learning layers of “flexible sprites” from a video sequence, and in particular, to a system and method for automatic decomposition of a video sequence for learning probabilistic 2-dimensional appearance maps and masks of moving occluding objects of dynamic geometry in the video sequence.
2. Related Art
Automatic modeling and analysis of video images using a layered representation has been addressed by several conventional schemes. In general, the basic idea is to isolate or identify a particular object or objects within a sequence of images, then to decompose that video sequence into a number of layers, with each layer representing either an object or a background image over the entire video sequence. Such layered objects are commonly referred to as “sprites.” However, learning “sprites” from a video sequence is a difficult task because there are typically an unknown number of objects in the video sequence, and those objects typically have unknown shapes and sizes, and they must be distinguished from the background, other sprites, sensor noise, lighting noise, and significant amounts of deformation.
In addition, many conventional schemes for identifying sprites or objects within a video sequence make use of specialized models for identifying particular types of objects, such as, for example, a car, a truck, a human head, a ball, an airplane, etc. Models designed for identifying one particular type of sprite within a video sequence are typically ineffective for identifying other types of sprites.
For example, one conventional scheme which is typical of the art addresses object tracking through a video sequence by using a dynamic layer representation to estimate layered objects against a background in the video sequence. This scheme uses a parametric shape prior as the basis for computing the segmentation of objects from the video sequence. However, this scheme is generally limited by the simplicity of the parametric shape priors used to identify layers of objects within the video sequence. In particular, this scheme makes use of simple parametric shape priors that are useful for identifying objects of generally simple geometry, such as moving vehicles against a fixed background. Unfortunately, this scheme is generally unsuited for segmenting or identifying more complex objects or sprites because it does not incorporate more complicated segmentation priors for identification of objects such as moving articulated human bodies. Further, the scheme is also limited in that it needs to make use of parametric shape priors which are at least generally representative of the type of objects to be segmented or identified within the video sequence.
In addition, in analyzing or processing data, the task of clustering raw data such as video or images frames and speech spectrograms is often complicated by the presence of random, but well-understood transformations in the data. Examples of these transformations include object motion and camera motion in video sequences and pitch modulation in spectrograms.
A variety of conventional, yet sophisticated, techniques for pattern analysis and pattern classification have been used in attempt to address this problem. However, such conventional techniques have mostly assumed that the data is already normalized (e.g., that the patterns are centered in the images) or nearly normalized. Linear approximations to the transformation manifold have been used to significantly improve the performance of feedforward discriminative classifiers such as nearest neighbors and multilayer perceptrons.
Linear generative models (factor analyzers, mixtures of factor analyzers) have also been modified using linear approximations to the transformation manifold to build in some degree of transformation invariance. A multi-resolution approach has been used to extend the usefulness of linear approximations, but this approach is susceptible to local minima—e.g. a pie may be confused for a face at low resolution. For significant levels of transformation, linear approximations are far from exact and better results have been obtained by explicitly considering transformed versions of the input. This approach has been used to design “convolutional neural networks” that are invariant to translations of parts of the input.
Further, it has been shown with respect to “transformed mixtures of Gaussians” and “transformed hidden Markov models” that an expectation-maximization (EM) algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and to learn a mixture model of the normalized data. The only input to such an algorithm is the data, a list of possible transformations, and the number of clusters to find. However, conventional methods for performing such computations typically involve an exhaustive computation of the posterior probabilities over transformations that make scaling up to large feature vectors and large sets of transformations intractable.
Therefore, what is needed is a system and method for automatically and dynamically decomposing a video sequence into a number of layers, with each layer representing either an object or a background image, over each frame of the video sequence. Such a system and method should be capable of identifying sprites or objects of any geometry, including those with dynamic or changing geometries through a sequence of images without the need to use object specific models. Further, such a system and method should be capable of reliably identifying sprites having unknown shapes and sizes which must be distinguished from the background, other sprites, sensor noise, lighting noise, and significant amounts of deformation. Finally, such a system and method should be capable of processing large data sets in real time, or near-real time.