One fundamental task in many computer vision applications segments foreground and background regions in a sequence of frames, i.e., a video, acquired of a scene. The segmentation is useful for higher-level operations, such as tracking moving objects.
One way to detect a region of “moving” pixels in a video is to first acquire a reference frame of a static scene. Then, subsequent frames acquired of the scene are subtracted from the reference frame, on a pixel-by-pixel basis, to segment the video into difference images. The intensity values in the difference images can be thresholded to detect the regions of moving pixels that are likely associated with moving objects in the scene.
Although this task appears fairly simple, in real world applications this approach rarely works. Usually, the background is never static. Instead, the background varies over time due to lighting changes, movement in the background, for example, clouds, leaves in trees, and waves on water, and camera noise. Moreover, in many applications, it is desirable to model different appearances of the background, for example, difference due to sunlight producing slowly moving shadows in the background that are not necessarily associated with moving foreground objects.
To overcome these problems, adaptive background models and filters have been used. For example, a Kalman-filter can model dynamic properties of each pixel, Dieter Koller, Joseph Weber, and Jitendra Malik, “Robust multiple car tracking with occlusion reasoning,” ECCV'94, May 1994. A simple version of the Kalman-filter, e.g., the Weiner filter, can make probabilistic predictions based on a recent history of pixel intensity values, K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Principles and practice of background maintenance,” Proc. 7th Intl. Conf. on Computer Vision, pp. 255-261, 1999.
An alternative method models probability distributions of pixel intensity values, C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” PAMI, 19(7), pp. 780-785, July 1997. That method essentially ignores an order in which observations are made. Usually, each pixel is modeled with a normal distribution N(μ, σ2), which varies over time. Noise is assumed to be coming from a zero-mean, normal distribution N(0, σ2). Hence, a reasonable model to represent such a statistical distribution is a single Gaussian function. The parameters of the model are updated according to an adaptive filter. That model performs adequately when the background of the scene background is uni-modal. However, this is usually not the case in real world applications.
Often, a single Gaussian model is inadequate to accurately model the temporal changes of pixel intensities value in a dynamic background, such a background with changing shadows due to changing lighting conditions. Therefore, more complex systems include mechanisms for rejecting lighting changes, such as intensity variability caused by shadows, Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What?” Proceedings of FG'98, IEEE, April 1998.
The use of multiple models to describe dynamic backgrounds at the pixel level was a breakthrough in background modeling. Specifically, methods employing a mixture of Gaussian distributions have become a popular for a large number of related computer vision applications.
A mixture of three Gaussian components can be used to model visual properties of each pixel, N. Friedman and S. Russell, “Image segmentation in video sequences: A probabilistic approach,” Thirteenth Conference on Uncertainty in Artificial Intelligence, August 1997. That model also uses an expectation-maximization (EM) process to adapt the Gaussian mixture model (GMM) over time. In a traffic surveillance application, the intensity value of each pixel is restricted to three hypotheses: road, shadow, and vehicles. Unfortunately, that simple assumption significantly degrades the ability of the GMM to model arbitrary distributions for individual pixels. Moreover, that method is computationally demanding.
Another method allows the scene to be non-static, Chris Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” Computer Vision and Pattern Recognition, volume 2, June 1999. In that method, each pixel is modeled as a mixture of Gaussian distributions with a variable number of Gaussian components. Those methods represent the background as a multi-modal process, where each mode is a static model plus a zero-mean, white, Gaussian noise process. The models can be updates in real-time using approximations. That video surveillance system is adequate for day and night cycles, and for scene changes over long periods of time.
That method can be extended by using a feature vector that includes depth information acquired from a pair of stereo cameras, M. Harville, G. Gordon, and J. Woodfill, “Foreground segmentation using adaptive mixture models in color and depth,” in IEEE Workshop on Detection and Recognition of Events in Video, pp. 3-11, 2001.
Gradient information can also be used to achieve a more accurate background segmentation, S. Jabri, Z. Duric, H. Wechsler, and A. Rosenfeld, “Location of people in video images using adaptive fusion of color and edge information,” Proc. 15th Int'l Conf. on Pattern Recognition, volume 4, pp. 627-630, 2000, and K. Javed, O. Shafique and M. Shah, “A hierarchical approach to robust background subtraction using color and gradient information,” IEEE Workshop on Motion and Video Computing, 2002.
Although a mixture of Gaussian models can converge to any arbitrary distribution provided there are a sufficient number of components, that normally requires a large number of components. However, that is not computationally feasible for real time applications. Generally, three to five components are used per pixel.
To address such challenging situations, non-parametric techniques have been developed. Those techniques use kernel densities to estimate background properties at each pixel, based on recently acquired samples. Those techniques can adapt to rapid background changes, Elgammal, D. Harwood, L. S. Davis, “Non-parametric model for background subtraction,” ECCV 2000, June 2000. That method uses a normal kernel function for density estimation. The model represents a history of recent sample values over a long video sequence.
Other similar techniques emphasize a variable bandwidth kernel for the purpose of adaptive density estimation. An optical flow can also be used, Anurag Mittal, Nikos Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” CVPR 2004, Volume 2, pp. 302-309, June, 2004.
Although non-parametric models seem like a reasonable choice for background modeling, the non-parametric models are time consuming, and cannot be used for real time applications.
Another method represents the scene as discrete states. The states correspond to environmental conditions in the scene. The method switches among the states according to observations. Hidden Markov models (HMMs) are very suitable for this purpose. A three state HMM is used by Rittscher et al., “A probabilistic background model for tracking,” Proc. European Conf on Computer Vision, volume II, pp. 336-350, 2000. Another method learns a topology from the observations, B. Stenger, V. Ramesh, N. Paragios, F. Coetzee, and J. Buhmann, “Topology free hidden Markov models: Application to background modeling,” Proc. 8th Intl. Conf. on Computer Vision, pp. 294-301, 2001.
Therefore, it is desired to provide a method for modeling a dynamic scene. Furthermore, it is desired to use the model to track objects in videos acquired at very low frame rates.