1. Technical Field
The invention is related to a system and method for machine learning, and in particular, to a system and method for fast on-line learning of generative models.
2. Related Art
In video scene analysis, machine learning algorithms, such as the transformed hidden Markov Model, capture three typical causes of variability in video-scene/object class, appearance variability within the class and image motion.
A substantial amount of work has been performed using transformed mixtures of Gaussians (TMG) [1, 2], and their temporal extensions, transformed hidden Markov models (THMM) [3], for video analysis. The TMG algorithm performs joint normalization and clustering of the data (e.g., clustering data by a given class). Transformation-invariant clustering models, such as the aforementioned, are suitable for video clustering and indexing, because they account for the variability in appearance and transformation in the objects and scenes.
Further, it has been shown with respect to TMG and TMM that an expectation-maximization (EM) algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and to learn a mixture model of the normalized data. Typically, the only input to such an algorithm is the data, a list of possible transformations, and the number of clusters or classes to find. However, conventional methods for performing such computations typically involve an exhaustive computation of the posterior probabilities over transformations that make processing of large sets of transformations intractable.
In general, as is well known to those skilled in the art, an EM algorithm is used to approximate a probability function. EM is typically used to compute maximum likelihood estimates given incomplete samples. In the expectation step (the “E-Step”), the model parameters are assumed to be correct, and for each input image, probabilistic inference is used to fill in the values of the unobserved or hidden variables, e.g., the class, transformation, and appearance. The model typically used in an EM algorithm includes the classes (means and variance) and the probability of each class. In the maximization step (the “M-Step”), the model parameters in the E-step are adjusted to increase the joint probability of the observations and the filled in unobserved variables. These two steps are then repeated until convergence of the model parameters and the observed data is achieved.
As discussed above, a frequently mentioned drawback of the transformation-invariant clustering methods is the computational burden of searching over all transformations. In order to normalize for translations of an object over the cluttered background in video sequences, a large number of possible translational shifts should be considered. For example, there are M×N possible integer shifts in an M×N pixel image. Since the computation time is proportional to the number of pixels and the number of transformations, O(M2N2) operations are used for inference, for each component in the Gaussian mixture. It typically takes one hour per iteration of the batch EM algorithm to cluster a 40-second long 44×28 pixel sequence into 5 clusters.
The temporal extension of the TMG, transformed hidden Markov models (THMM), use a hidden Markov chain to capture temporal coherence of the video frames. The size of the state space of such an HMM is CMN where C is the number of components in the Gaussian mixture, and LMN is the number of translations considered. In [2], a forward-backward algorithm is used to estimate the transition probabilities and the parameters of a THMM, but use of this forward-backward algorithm adds additional computational time to the TMG, because the transition matrix of the transformations is large. The forward-backward is also numerically unstable, due to the large number of state-space sequences ((CMN)T for a C-class model for T frames, each having M×N pixels), and the high dimensionality of the data. Only a few state-space paths carry a significant probability mass, and the observation likelihood has a very high dynamic range due to the number of pixels modeled in each sample. This makes the forward-backward algorithm sensitive to the machine precision issues, even when the computation is done in the log domain.
To tackle the computational burden of shift-invariant models, in past work [4], it was proposed to reduce all computationally expensive operations to image correlations in the E step and convolutions with the probability maps in the M step, which made the computation efficient in the Fourier domain. There, the complexity of repeatedly evaluating the likelihood at each stage through I iterations of EM is of the order of O(CIMN log(MN)), thousands of times faster than the technique in [2]. The issues present in the temporal model, THMM, however, still remained.
Therefore, what is needed is a model structure and associated system and method of learning generative models that runs in real-time.