Many computer vision and video surveillance applications seek to identify moving objects, for example, pedestrians, vehicles, or events of interest in different environments. Typically, the detection of unusual motion is performed first. Motion detection distinguishes moving ‘foreground’ objects in an otherwise normally static ‘background.’ This stage is often referred to as ‘foreground detection’ or ‘background subtraction’. A number of techniques are known that use different types of background models that update the models at a pixel level.
Over time, the intensity value of an individual pixel in a static background usually follows a normal distribution. Hence, a reasonable model to represent such a statistical distribution is a single Gaussian model, C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” PAMI, 19(7), pp. 780-785, July 1997.
Often, a single Gaussian model is inadequate to accurately model the temporal changes of a pixel intensity value in a dynamic background, such as a background with changing shadows due to changes in lighting conditions. The use of multiple models to describe dynamic backgrounds at the pixel level was a breakthrough in background modeling. Specifically, methods employing a mixture of Gaussian distributions have become a popular basis for a large number of related techniques in recent years.
A mixture of three Gaussian components can be used to model visual properties of each pixel, N. Friedman and S. Russell, “Image segmentation in video sequences: A probabilistic approach,” Thirteenth Conference on Uncertainty in Artificial Intelligence, August 1997. That model also uses an expectation-maximization (EM) process to learn the Gaussian Mixture Model (GMM) over time. In a target traffic surveillance application, the intensity value of each pixel is restricted to three hypotheses: road, shadow, and vehicles. Unfortunately, that simple assumption significantly degrades the ability of the GMM to model arbitrary distributions for individual pixels. Moreover, that method is computationally expensive.
Another technique models each pixel as a mixture of Gaussian distributions with a variable number of Gaussian components, W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee, “Using adaptive tracking to classify and monitor activities in a site,” CVPR'98, 1998. Those models can be updates in real-time using approximations. That video surveillance system has been proven robust for day and night cycles, and for scene changes over long periods of time. However, for backgrounds that exhibit very rapid variations, such as ripples on water, ocean waves, or moving grass and trees, that model can result in a distribution with a large variance over a long video sequence. Thus, the sensitivity for detecting foreground objects is reduced significantly.
To address such challenging situations, non-parametric techniques have been developed. Those techniques use kernel densities to estimate background properties at each pixel, based on multiple recently acquired samples. Those techniques can adapt to rapid background changes, Elgammal, D. Harwood, L. S. Davis, “Non-parametric model for background subtraction,” ECCV 2000, June 2000. That method uses a normal kernel function for density estimation. The model represents a history of recent sample values over a long video sequence.
Another similar technique emphasizes a variable bandwidth kernel for the purpose of adaptive density estimation. As another feature, an optical flow can be used, Anurag Mittal, Nikos Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” CVPR 2004, Volume 2, pp. 302-309, June, 2004.
Other techniques that deal with effective background modeling can be categorized as predictive methods. Predictive methods treat pixel intensity changes as a time series and use a temporal model to predict a next pixel value, based on past observations. The deviation between the predicted value and the actual observation can be used to adjust the parameters of the predictive model.
Other methods use filters. For example, a Kalman-filter can model the dynamic properties of each pixel, Dieter Koller, Joseph Weber, and Jitendra Malik, “Robust multiple car tracking with occlusion reasoning,” ECCV'94, May 1994. A simple version of the Kalman-filter, e.g., the Weiner filter, can make probabilistic predictions based on a recent history of pixel intensity values.
An autoregressive model captures properties of dynamic scenes for the purpose of similar textures simulation, G. Doretto A. Chiuso, S. Soatto, Y. N. Wu, “Dynamic textures,” IJCV 51(2), pp. 91-109, 2003. That method was improved to address the modeling of dynamic backgrounds and to perform foreground detection in video surveillance, Antoine Monnet, Anurag Mittal, Nikos Paragios, Visvanathan Ramesh, “Background modeling and subtraction of dynamic scenes,” ICCV'03, p. 1305, October, 2003, and Jing Zhong and Stan Sclaroff, “Segmenting foreground objects from a dynamic textured background via a robust Kalman Filter,” ICCV'03, pp. 44-50, 2003. Although good results have been obtained for some challenging sample video, the computation cost of using such an autoregressive model is high.
In general, pixel-level background modeling suffers from two major disadvantages. First, the computational complexity of those models is inherently high. Every pixel must be processed in each video frame. In many challenging dynamic scenes, a number of different frequency components demand a model with many Gaussian distributions or a highly complicated predictive model to precisely capture the recurrent patterns of motion at a single pixel over time. The performance trade-off between detection accuracy and computation cost is always a hard decision in choosing a pixel-level background model.
Secondly, the intensity value at individual pixels is very easily affected by noise. In essence, what is lacking in pixel-level models is some higher-level information, which is more robust and can be derived from regions in the frame or even from the entire frame.
One method attempts to guide the pixel-level mixture of a Gaussian model by incorporating feedback from high-level modules, M. Harville, “A framework for high-level feedback to adaptive, per-pixel, Mixture-of-Gaussian Gaussian background models,” ECCV'02, vol. 3, pp. 543-560, May 2002. However, the basis of that framework is still a pixel-level background model.
Therefore, there is a need for a background modeling that considers high-level information in a video.