Many computer vision and video surveillance applications seek to identify moving foreground objects, for example, pedestrians, vehicles, or events of interest in different scenes. Generally, something is interesting in a scene when it is substantially different from a background model of a stationary scene acquired by a camera. The simplest background model assumes that the scene is truly static over time.
Over time, the intensity value of an individual pixel in a static background usually follows a normal distribution. Therefore, the uninteresting variability in the scene can be modeled adequately by a unimodal, zero-mean, ‘white’, Gaussian noise process. Hence, a reasonable model to represent such a statistical distribution is a single Gaussian model, C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” PAMI, 19(7), pp. 780-785, July 1997.
Often, a single Gaussian model is inadequate to accurately model the temporal changes of a pixel intensity value in a dynamic background, such a background with changing shadows due to changes in lighting conditions. Therefore, more complex systems include mechanisms for rejecting lighting changes as uninteresting, such as variability caused by cast shadows, Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What?” Proceedings of FG'98, IEEE, April 1998.
The use of multiple models to describe dynamic backgrounds at the pixel level was a breakthrough in scene modeling. Specifically, methods employing a mixture of Gaussian distributions have become a popular basis for a large number of related applications in recent years.
A mixture of three Gaussian components can be used to model visual properties of each pixel, N. Friedman and S. Russell, “Image segmentation in video sequences: A probabilistic approach,” Thirteenth Conference on Uncertainty in Artificial Intelligence, August 1997. That model also uses an expectation-maximization (EM) process to learn the Gaussian Mixture Model (GMM) over time. In a target traffic surveillance application, the intensity value of each pixel is restricted to three hypotheses: road, shadow, and vehicles. Unfortunately, that simple assumption significantly degrades the ability of the GMM to model arbitrary distributions for individual pixels. Moreover, that method is computationally expensive.
Another method allows the scene to be non-static, Chris Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” Computer Vision and Pattern Recognition, volume 2, June 1999. Each pixel is modeled as a mixture of Gaussian distributions with a variable number of Gaussian components. That method represents the background as a multi-modal process, where each mode is a static model plus a zero-mean, white, Gaussian noise process. The models can be updated in real-time using approximations. That video surveillance system has been proven robust for day and night cycles, and for scene changes over long periods of time.
However, for scenes that exhibit very rapid variations, such as ripples on water, ocean waves, or moving grass and trees, that model can result in a distribution with a large variance over a long video sequence. Thus, the sensitivity for detecting foreground objects is reduced significantly.
To address such challenging situations, non-parametric techniques have been developed. Those techniques use kernel densities to estimate properties of each pixel based on multiple recently acquired samples and can adapt to rapid changes in the background of a scene, Elgammal, D. Harwood, L. S. Davis, “Non-parametric model for background subtraction,” ECCV 2000, June 2000. That method uses a normal kernel function for density estimation. The model represents a history of recent sample values over a long video sequence.
Another similar technique emphasizes a variable bandwidth kernel for the purpose of adaptive density estimation. As another feature, an optical flow can be used, Anurag Mittal, Nikos Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” CVPR 2004, Volume 2, pp. 302-309, June, 2004.
Other techniques that deal with effective scene modeling can be categorized as predictive methods. Predictive methods treat pixel intensity changes as a time series and use a temporal model to predict a next pixel value, based on past observations. The deviation between the predicted value and the actual observation can be used to adjust the parameters of the predictive model.
Other methods use filters. For example, a Kalman-filter can model the dynamic properties of each pixel, Dieter Koller, Joseph Weber, and Jitendra Malik, “Robust multiple car tracking with occlusion reasoning,” ECCV'94, May 1994. A simple version of the Kalman-filter, e.g., the Weiner filter, can make probabilistic predictions based on a recent history of pixel intensity values.
An autoregressive model captures properties of dynamic scenes for the purpose of similar textures simulation, G. Doretto A. Chiuso, S. Soatto, Y. N. Wu, “Dynamic textures,” IJCV 51(2), pp. 91-109, 2003. That method was improved to address the modeling of dynamic backgrounds and to perform foreground detection in video surveillance, Antoine Monnet, Anurag Mittal, Nikos Paragios, Visvanathan Ramesh, “Background modeling and subtraction of dynamic scenes,” ICCV'03, p. 1305, October, 2003; and Jing Zhong and Stan Sclaroff, “Segmenting foreground objects from a dynamic textured background via a robust Kalman Filter,” ICCV'03, pp. 44-50, 2003. Although good results have been obtained for some challenging sample videos, the computation cost of using such an autoregressive model is high.
In general, conventional scene modeling suffers from two major disadvantages. First, the computational complexity of those models is inherently high. Every pixel must be processed in each video frame. In many challenging dynamic scenes, a number of different frequency components demand a model with many Gaussian distributions or a highly complicated predictive model to precisely capture the recurrent patterns of motion at a single pixel over time. The performance trade-off between detection accuracy and computation cost is always a hard decision in choosing a pixel-level scene model.
Secondly, the intensity value at individual pixels is very easily affected by noise. In essence, what is lacking in pixel-level models is some higher-level information, which is more robust and can be derived from regions in the frame or even from the entire frame.
One method attempts to guide the pixel-level mixture of a Gaussian model by incorporating feedback from high-level modules, M. Harville, “A framework for high-level feedback to adaptive, per-pixel, Mixture-of-Gaussian background models,” ECCV'02, vol. 3, pp. 543-560, May 2002. However, the basis of that framework is still a pixel-level model.
Most of the above referenced techniques have a common assumption of the white Gaussian process. They assume that the observation process has independent increments, Henry Stark and John W. Woods, “Probability, Random Processes, and Estimation Theory for Engineers,” Prentice Hall, 2 edition, 1994.
Cyclostationarity
The independent increments assumption means that two samples drawn from the same pixel location are independent. The samples can be drawn from the same probability distribution, but the samples are independent samples from that distribution. A segmentation process, e.g., background subtraction, determines whether samples are drawn from the background distribution, or from some other, more interesting ‘foreground’ distribution. By assuming independent increments, the techniques rely completely on the appearance of the scene.
Consider the case of a tree blowing in the wind. The multi-modal model of Stauffer et al. would model the appearance of the sky, leaves, and branches separately. As the tree moves, an individual pixel can image any of these. The independent increments assumption says that these different appearances can manifest in any order. However, the tree moves with a characteristic frequency response that is related to the physical composition of the tree. That characteristic response should constrain the ways that the various appearances are modeled.
Specifically, given two samples from an observation process: X[k] and X[l], the independent increments assumption states that the autocorrelation function Rx[k,l ] is zero when k≠1:
                                          R            x                    ⁡                      [                          k              ,              l                        ]                          ⁢                  =          Δ                ⁢                  E          ⁡                      [                                          X                ⁡                                  [                  k                  ]                                            ⁢                                                X                  *                                ⁡                                  [                  l                  ]                                                      ]                                              (        1        )                                          =                                    σ              2                        ⁢                          δ              ⁡                              [                                  k                  -                  l                                ]                                                    ,                            (        2        )            where σ2=E[X[k]X*[k]] is the sample covariance, and δ [k−l] is a discrete-time impulse function. This function is correct when the process is stationary and white, such as a static scene observed with white noise.
For a situation where the observations are driven by some physical, dynamic process, the dynamic process leaves a spectral imprint on the observation covariance. If the process is simply periodic, then one expects to see very similar observations occurring with a period of T samples. In contrast to the above model, one has:Rx[k, k+T]≠0.This process is cyclostationary when the above relationship is true for all time periods.
More generally, wide-sense cyclostationarity is defined as:μ[k]=μ[k+T]∀t, and   (3)Kx[k, l]=Kx[k+T,l+T]∀k,l,  (4)where Kx[k, l] is an autocovariance function for processes that are not zero-mean, see Stark et al., above. These types of processes can be more complex than the simply periodic.
As shown in FIG. 1, these processes are characterized by significant structure in their autocorrelation function, as expressed by a self-similarity matrix 100. The matrix in FIG. 1 is derived from one particular pixel location in a sequence of frames of waves lapping on a beach.
FIG. 2 shows a sample trace 200 from the same pixel. The process is said to be harmonizable when the autocorrelation of the process can be reduced to the form Rx[k−l]. That is, the autocorrelation is completely defined by the time difference between the samples.
It is possible to estimate the spectral signature of harmonizable, cyclostationary processes in a compact, parametric representation utilizing a Fourier transform, Dominique Dahay and H. L. Hurd, “Representation and estimation for periodically and almost periodically correlated random processes,” W. A. Gardner, editor, Cyclostationarity in Communications and Signal Processing, IEEE Press, 1993.
FIG. 3 shows an example Fourier transform 300 of the same pixel used for FIGS. 2 and 1.
In the case of evenly sampled, discrete observation processes as used in computer vision applications, a fast Fourier transform (FFT) can be used.
It is desired to construct a scene model that represents these spectral signatures of a scene. Furthermore, it is desired to detect changes in the scene that are inconsistent with these spectral signatures. By leveraging these dynamic constraints, it should be possible to achieve higher specificity than a prior art background segmentation process that ignores these constraints. With such a scene model, it should be possible to locate low-contrast objects embedded in high-variance, dynamic scenes that are largely inaccessible to conventional techniques.
Spectral Similarity
Spectral fingerprints can be used as a classification feature. However, prior art spectral methods have only been used to classify stationary foreground objects, Ross Cutler and Larry S. Davis, “Robust real-time periodic motion detection, analysis, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp. 781-796, August 2000; Fang Liu and Rosalind W. Picard, “Finding periodicity in space and time, International Conference on Computer Vision, Narosa Publishing House, 1998; and Yang Ran, Isaac Weiss, Qinfen Zheng, and Larry S. Davis, “An efficient and robust human classification algorithm using finite frequencies probing,” Conference on Computer Vision and Pattern Recognition Workshop. IEEE, June 2004.
That is, the objects are either stationary in the video or the objects have been extracted from the scene and stabilized by some other process, typically one of the background segmentation schemes discussed above combined with some kind of tracker framework.
Some prior are representations for temporal textures in videos permit searching for specific activities. Those representations needed to be compact for storage in databases and concise for quick indexing. As a result, those representations summarize the spectral content as a single number, for example, a ratio of harmonic power to non-harmonic power in the signal. This involves extracting specific features from the signal in the Fourier domain.
However, it is desired to make no prior assumptions about what features are interesting in the frequency domain. That is, it is desired to use the Fourier signal directly.
One surveillance method uses spectral fingerprints obtained by an analysis of the full process autocorrelation function. For example, that method can detect pedestrians and pedestrians with ‘sprung’ masses, e.g., backpacks, satchels, and the like. However, the word ‘detect’ is somewhat misleading. That method classifies objects as pedestrian and non-pedestrian after the objects are extracted using conventional segmentation techniques. As a result, that method makes an independent increments assumption about the scene dynamics, while exploiting rich descriptions of foreground object dynamics.
Another system uses a priori models, Zongyi Liu and S. Sarkar, “Challenges in segmentation of human forms in outdoor video,” Perceptual Organization in Computer Vision, IEEE, June 2004. They model a particular foreground process in a video that is deemed a priori to be interesting, e.g., the periodicity in pedestrian motion. However, they also assume that the foreground object has already been segmented from the background. The periodicity is only used to classify a particular motion after the foreground object has been segmented.
It is desired to construct a model of an observed scene in situ, without having any preconceived knowledge of what the underlying process is. Such a model would be sensitive to anything that is sufficiently different in the scene.