Many computer vision and video surveillance applications seek to identify moving objects, for example, pedestrians, or vehicles in different environments. Generally, something is interesting in a scene when it is substantially different from a background model of a stationary scene acquired by a camera. The simplest background model assumes that the background scene is truly static over time, and that objects in the scene move at speeds that are consistent with the objects.
Over time, the intensity value of an individual pixel in a static background usually follows a normal distribution. Therefore, the uninteresting variability in the scene can be modeled adequately by a unimodal, zero-mean, ‘white’, Gaussian noise process. Hence, a reasonable model to represent such a statistical distribution is a single Gaussian model, C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” PAMI, 19(7), pp. 780-785, July 1997.
Often, a single Gaussian model is inadequate to accurately model the temporal changes of a pixel intensity value in a dynamic background, such a background with changing shadows due to changes in lighting conditions. Therefore, more complex systems include mechanisms for rejecting lighting changes as uninteresting, such as variability caused by cast shadows, Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What?” Proceedings of FG'98, IEEE, April 1998.
The use of multiple models to describe dynamic backgrounds at the pixel level was a breakthrough in background modeling. Specifically, methods employing a mixture of Gaussian distributions have become a popular basis for a large number of related applications in recent years.
A mixture of three Gaussian components can be used to model visual properties of each pixel, N. Friedman and S. Russell, “Image segmentation in video sequences: A probabilistic approach,” Thirteenth Conference on Uncertainty in Artificial Intelligence, August 1997. That model also uses an expectation-maximization (EM) process to learn the Gaussian Mixture Model (GMM) over time. In a traffic surveillance application, the intensity value of each pixel is restricted to three hypotheses: road, shadow, and vehicles. Unfortunately, that simple assumption significantly degrades the ability of the GMM to model arbitrary distributions for individual pixels. Moreover, that method is computationally expensive.
Another method allows the scene to be non-static, Chris Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” Computer Vision and Pattern Recognition, volume 2, June 1999. Each pixel is modeled as a mixture of Gaussian distributions with a variable number of Gaussian components. That method represents the background as a multi-modal process, where each mode is a static model plus a zero-mean, white, Gaussian noise process. The models can be updated in real-time using approximations. That video surveillance system has been proven robust for day and night cycles, and for scene changes over long periods of time. However, for backgrounds that exhibit very rapid variations, such as ripples on water, ocean waves, or moving grass and trees, that model can result in a distribution with a large variance over a long video sequence. Thus, the sensitivity for detecting foreground objects is reduced significantly.
A similar competitive multi-modal background process is described by F. Porikli and O. Tuzel, “Human body tracking by adaptive background models and mean-shift analysis,” in Conference on Computer Vision Systems, Workshop on PETS, IEEE, April 2003, incorporated herein by reference.
To address such challenging situations, non-parametric techniques have been developed. Those techniques use statistics of the pixel values to estimate background properties at each pixel, based on multiple recently acquired samples. Those techniques can adapt to rapid background changes, Elgammal, D. Harwood, L. S. Davis, “Non-parametric model for background subtraction,” ECCV 2000, June 2000. That method uses a Gaussian function for density estimation. The model represents a history of recent sample values over a long video sequence.
Other similar techniques emphasize a variable size kernel for the purpose of adaptive density estimation. A kernel corresponds to a search region in the data space. As another feature, an optical flow can be used, Anurag Mittal, Nikos Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” CVPR 2004, Volume 2, pp. 302-309, June, 2004.
Other techniques that deal with effective background modeling can be categorized as predictive methods. Predictive methods treat pixel intensity changes as a time series and use a temporal model to predict a next pixel value, based on past observations. The deviation between the predicted value and the actual observation can be used to adjust the parameters of the predictive model.
Other methods use filters. For example, a Kalman-filter can model the dynamic properties of each pixel, Dieter Koller, Joseph Weber, and Jitendra Malik, “Robust multiple car tracking with occlusion reasoning,” ECCV'94, May 1994. A simple version of the Kalman-filter, e.g., the Weiner filter, can make probabilistic predictions based on a recent history of pixel intensity values.
An autoregressive model captures properties of dynamic scenes for the purpose of similar textures simulation, G. Doretto A. Chiuso, S. Soatto, Y. N. Wu, “Dynamic textures,” IJCV 51(2), pp. 91-109, 2003. That method was improved to address the modeling of dynamic backgrounds and to perform foreground detection in video surveillance, Antoine Monnet, Anurag Mittal, Nikos Paragios, Visvanathan Ramesh, “Background modeling and subtraction of dynamic scenes,” ICCV'03, p. 1305, October, 2003, and Jing Zhong and Stan Sclaroff, “Segmenting foreground objects from a dynamic textured background via a robust Kalman Filter,” ICCV'03, pp. 44-50, 2003. Although good results have been obtained for some challenging sample video, the computation cost of using such an autoregressive model is high.
In general, conventional background modeling suffers from two major disadvantages. The computational complexity of those models is inherently high. This is a particular problem in a large scale video surveillance system, where a large number of videos are acquired concurrently, and where it is desired to track objects in the video in real-time. Conventional systems require costly network, storage, and processing resources.
Therefore, it is desired to provide a system and method for concurrently tracking multiple objects in a large number of videos with reduced network, storage, and processing resources.
In addition, conventional methods assume that objects move at a speed that is consistent with the objects, and that the objects to be tracked have a substantial amount of overlap in successive frames. Thus, the conventional methods expect the location of an object in one frame to be substantially co-located with the object in a next frame.
Therefore, it is desired to track objects that move at speeds that are inconsistent with the objects.