Tracking of multiple objects in video is a key issue in many applications including video surveillance, human computer interaction, and video conferencing. Multiple object tracking has also become a challenging research topic in computer vision. Some difficult issues involved are the handling of cluttered background, unknown number of objects, and complicated interactions between objects within a scene. Most tracking algorithms compute the a posteriori probability of object states and adopt a probabilistic hidden Markov model (HMM) to track objects in a video sequence.
As shown in FIG. 1, the states of an object at different time instances xtεX, t=1,2, . . . n, form a Markov chain. State xt belongs to the parameter space X that may contain parameters such as the position, scale factor, or deformation parameters of the object. At each time instance t, conditioned on xt, the observation zt is independent of other object states or observations. Observations can be some observed variables representing objects in an image or the image itself. This model is summarized as:                               P          ⁡                      (                                          x                1                            ,                              x                2                            ,                                                …                  ⁢                                                                           ⁢                                      x                    n                                                  ;                                  z                  1                                            ,                              z                2                            ,              …              ⁢                                                           ,                              z                n                                      )                          =                              P            ⁡                          (                              x                1                            )                                ⁢                      P            (                                          z                1                            ⁢                                                                x                  1                                )                            ⁢                                                ∏                                      t                    =                    2                                    n                                ⁢                                                                   ⁢                                  [                                      P                    ⁡                                          (                                                                        x                          t                                                ⁢                                                                                                        x                                                          t                              -                              1                                                                                )                                                ⁢                                                  P                          (                                                      z                            t                                                                                                    ⁢                                                  x                          t                                                                    )                                                        ]                                                                                        (        1        )            
The object tracking problem can be posed as the computation of the α posteriori distribution P(xt|Zt) given observations Zt={Z1, Z2, . . . , Zt}. When a single object is tracked, the maximum a posteriori (MAP) solution is desired. When both the object dynamics P(xt|xt−1) and observation likelihood P(zt|xt) are Gaussian, P(xt|Zt) is also Gaussian and the MAP solution is E(xt|Zt).
To compute P(xt|Zt) for HMM, a forward algorithm can be applied. The forward algorithm computes P(xt|Zt) based on P(xt−1|Zt−1) in an inductive manner and is formulated asP(xt|Zt)∝P(zt|xi)P(xt|Zt−1)=P(ztxt)∫P(xt|xt−1)P(xt−1|Zt−1)dxt−1  (2) 
Using this formula, a well-known Kalman filter computes E(xt|Zt) for a Gaussian process. When either P(xt|xt−1) or P(zt|xt) is not in an analytic form, sampling algorithm techniques need to be applied to implement the forward algorithm. For the situation where P(xt|xt−1) is Gaussian and P(zt|xt) is non-Gaussian, prior art object tracking algorithms have used a CONDENSATION algorithm. The CONDENSATION algorithm simulates P(xt|Zt) with many samples and propagates this distribution through time by integrating the likelihood function P(zt|xt) and the dynamics P(xt|xt−1). Alternatively, a variance reduction method within Monte Carlo approach, known as importance sampling, may also be applied to reduce the number of samples. The CONDENSATION algorithm converges to a single dominant peak in the posterior distribution. In its stated form, the CONDENSATION algorithm does not generalize to multi-object tracking.
When multiple objects are involved and the number of objects are known and fixed, an analytic-form tracker can use a Gaussian mixture model and Kalman filter for posterior density estimation. When the number of objects may change at any time, semi-analytic methods such as multiple-hypothesis tracking can be used. However, the complexity of this algorithm is exponential with respect to time and a pruning technique is necessary for practical reasons.
If xt is a parameter of an object, propagating distribution P(xt|Zt) by sampling also has difficulty tracking multiple objects. According to Equation (2), P(xt|Zt) is essentially the multiplication of dynamically changed and diffused version (by P(xt+1|xt)) of likelihood function P(zt|xt). When the likelihood of an object is constantly larger or smaller than that of another object (in practice this happens quite often), the ratio between the a posteriori probabilities of these two objects increase exponentially with respect to t. If a fixed number of object samples are used, say a CONDENSATION-like algorithm, when t is larger than a certain value, only the object with the dominant likelihood is likely to be tracked. This phenomenon is illustrated in FIGS. 2A and 2B. When t=20, more than 3000 samples are needed to obtain a single sample from an object with a smaller likelihood. In this example, for convenience of computation, no blurring is considered, which is equivalent to P(xt|xt−1)=δ(xt−1). Additionally, it is assumed that the object is not moving, which is also the case when the algorithm is applied to a single video frame. FIG. 2A depicts P(x1|z1) having modes 202 and 204 and FIG. 2B depicts P(x20|Z20)=P(x1|Z1)20 having modes 206 and 208. Using a smoother P(xt|xt−1) will reduce the peak value ratio between the modes. However, the ratio of modes still increases exponentially with the number samples.
A conclusion from the above analysis is that as long as P(xt|Zt) is approximated with samples whose total number increases less than exponentially with respect to number of iterations, the tracker will converge to the objects having the maximum likelihood. When the likelihood function is biased and one mode is always has higher likelihood value, the tracker can only track that object from frame to frame. When the forward algorithm is applied to the same image for many times, the object parameter with maximum likelihood can be found. In summary, the CONDENSATION algorithm converges to the dominant mode (208 in FIG. 2B) of the distribution and suppresses the rest (206 in FIG. 2B).
Therefore, a need exists in the art for a method and apparatus for tracking multiple objects within a sequence of video frames.