Foreground Detection
In many computer vision applications, e.g., surveillance, tracking, and recognition applications, a necessary first step is to detect foreground objects in a scene. Typically, this is done by a background subtraction method 100, as shown in FIG. 1A. A sequence of input images 110 acquired of a scene is processed 120 to generate a background image 130. The background image is then subtracted 140 from the images 110 to yield foreground images 150.
Prior art foreground detection methods either makes strict assumptions about the composition of the scene, or fail to handle abrupt illumination changes, or are computationally complex and time consuming.
Prior art foreground detection methods can generally be classified as single-layer or multi-layer. Single-layer methods develop a single dynamical model based on the past observations.
The simplest way to construct a single-layer background image is to measure a mean or variance of the pixel intensities. The intensities can be measured on a per (RGB) channel basis, to characterize color variations. Then, the intensities can be used as a threshold to detect foreground regions. However, such an averaging operation usually produces ‘ghost’ regions that are neither the true background nor true foreground.
One method models statistical distributions of background pixels with a Gaussian function and alpha-blending, C. R. Wren, A. Azarbayejani, T. J. Darrell, and A. P. Pentland “Pfinder: Real-time tracking of the human body,” PAMI, 19(7), pp. 780-785, July 1997. A background image Bt is updated according to a current image It using a preset blending weight α, such that Bt=(1−α)Bt−1+αIt.
The blending weight α adjusts how fast the background should blend with the current image. That averaging method is very sensitive to the selection of the blending weight. Depending to the value of α, the foreground objects can be subsumed into the background, or illumination changes are not accommodated.
An alternative ‘voting’ method selects intensity values based on frequency of occurrence. That ‘voting’ approach has advantages over averaging. It does not blur the background and allows sudden illumination changes. The major drawback for the voting approach is its computational complexity. Quantization can be applied to decrease the number of candidate values and the number of operations. However, quantization decreases the ability to separate the foreground from the background.
Kalman filters can also be used for background detection, K. Karmann, A. Brand, “Time-varying image processing and moving object recognition,” Elsevier Science Publish., 1990, C. Ridder, O. Munkelt, and H. Kirchner, “Adaptive background estimation and foreground detection using Kalman filtering,” Proc. ICAM, 1995, and K. Toyama, J. Krumm, B. Brumitt, B. Meyers, “Wallflower: Principles and Practice of Background Maintenance,” Proc. of Int'l Conf. on Computer Vision, pp. 255-261, 1999. A version of the Kalman filter that operates directly on the data subspace is described by J. Zhong and S. Sclaroff, in “Segmenting foreground objects from a dynamic, textured background via a robust Kalman filter,” Proc. of IEEE Int'l Conf. on Computer Vision, pp. 44-50, 2003.
A similar autoregressive model acquires properties of dynamic scenes, A. Monnet, A. Mittal, N. Paragios, and V. Ramesh, “Background modeling and subtraction of dynamic scenes,” Proc. of IEEE Int'l Conf. on Computer Vision, pp. 1305-1312, 2003.
The Kalman filter provides optimal estimates for the state of a discrete-time process that obeys a linear stochastic difference equation, e.g., the intensities of the background. The various parameters of the Kalman filter, such as a transition matrix, a process noise covariance, and a measurement noise covariance can change at each time step. By using larger covariance values, the background adapts quicker to illumination changes. However, the filter becomes more sensitive to noise and objects in the scene. Another drawback of the Kalman filter is its inability to distinguish multimodal distribution i.e., moving leaves or grass, or waves on water. The Kalman filter gives poor performance in the presence of large non-linearities.
Another method models the background in an image with a mixture of Gaussian distribution functions, C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” Proc. of IEEE Int'l Conf. on Computer Vision and Pattern Recognition, 1999. Rather than explicitly modeling the values of all the pixels as one particular type of distribution, the background is modeled by a pixel-wise mixture of Gaussian distribution functions to support multimodal backgrounds. Based on the persistence and the variance of each Gaussian function in the mixture, the Gaussian models that correspond to background regions are determined. Pixel values that do not fit the background distributions are considered foreground, until there is a Gaussian model that includes them in the background with sufficient and consistent supporting evidence. That method includes a learning constant and a parameter that controls the proportion of the pixels that should be accounted for by the background. The mixture-of-Gaussians method is the basis for a large number of related methods, O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robust background subtraction using color and gradient information,” MVC, pp. 22-27, 2002.
The mixture methods are adaptable to illumination changes and do not cause ghost effect. Furthermore, the mixture methods can handle multimodal backgrounds. However, their performance deteriorates when the scene is dynamic and exhibits non-stationary properties in time. Another drawback of the mixture model based solutions is the computational load of constructing and updating the background models. For a large number of models in the mixture, such methods become computationally too demanding to be practical.
A non-parametric method uses Gaussian kernels for modeling the density at a particular pixel, A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background subtraction,” Proc. of European Conf. on Computer Vision, pp. II:751767, 2000.
Another method integrates optical flow in the modeling of the dynamic characteristics, A. Mittal and N. Paragios, “Motion-Based background subtraction using adaptive kernel density estimation,” Proc. Int'l Conf. on Computer Vision and Pattern Recognition, 2004.
Statistical methods can use color, global and dynamic features to enhance object detection features, C. Jiang and M. O. Ward, “Shadow identification”, Proc. of IEEE Int'l Conf. on Computer Vision and Pattern Recognition, pp. 606-612, 1992, and Stauder, R. Mech, and J. Ostermann, “Detection of moving cast shadows for object segmentation”, IEEE Transactions on Multimedia, vol. 1, no. 1, pp. 65-76, March 1999.
Therefore, it is desired to improve over prior art foreground detection methods.
Intrinsic Images
One interpretation of a scene states that every image is the product of characteristics of the scene. Then, an ‘intrinsic’ image is a decomposition of an image that reflects one of the characteristics in the scene, H. G. Barrow and J. M. Tenenbaum, “Recovering intrinsic scene characteristics from images,” Computer Vision Systems, Academic Press, pp. 3-26, 1978.
The decomposition of input images It of a scene into a reflectance image R and illumination images Lt can be expressed as the productIt=R·Lt.  (1)
The reflectance image R contains the reflectance values of the scene, and the illumination images Lt contain the illumination intensities. Because the illumination images Lt represent the distribution of incident lighting in the scene, while the reflectance image R depict the surface reflectance properties of the scene, this representation becomes useful to analyze and manipulate the reflectance and lighting properties of the scene as acquired in the input images.
In the prior art, intrinsic images deal with spatial characteristics within a single image of a scene, such as illumination, and reflectance as visible in image texture, and not the temporal evolution of foreground objects in the scene itself.
As shown in FIG. 1B, one decomposition method estimates a maximum likelihood to produce reflectance images 102 and illumination images 103 from a sequence of input images 101 acquired from a fixed point under significant variations in lighting conditions, Y. Weiss, “Deriving intrinsic images from image sequences,” Proc. of IEEE Int'l Conf. on Computer Vision, pp. 68-75, July, 2001. Note, that lighting from the right (R) and left (L) is clearly visible in the illumination image
That method has been extended to derive varying reflectance images and corresponding illumination images from a sequence of images. Y. Matsushita, K. Nishino, K. Ikeuchi, and S. Masao, “Illumination normalization with time-dependent intrinsic images for video surveillance,” Proc. of IEEE Int'l Conf. on Computer Vision and Pattern Recognition, 2004. Matsushita et al. also describe an illumination eigenspace, which captures the illumination variations. In that method, the scene is static, the only thing that is detected are variation of external factors of the scene, such as lighting conditions. The motion of foreground objects is not considered.
Another method recovers an illumination invariant image, which is similar to a reflectance image, from a single color image, G. D. Finlayson, S. D. Hordley and M. S. Drew, “Removing Shadows from Images,” Proc. of European Conf. on Computer Vision Vol. 4, pp. 823-836, 2002. Finlayson et al. assume the input image contains both non-shadowed surfaces and shadows cast on those surfaces. They calculate an angle for an ‘invariant direction’ in a log-chromaticity space by minimizing an entropy of the color distribution.
Another method uses multiple cues to recover shading and reflectance images from a single image, M. Tappen, W. Freeman, E. Adelson, “Recovering Shading and Reflectance from a single image,” NIPS, 2002 Tappen et al., use both color information and a classifier trained to recognize gray-scale patterns. Each intrinsic image is classified as being caused by shading or a change in the surfaces reflectance. Note, that method also does not consider the motion of foreground objects in a scene.
Another deterministic method uses gray levels, and local and static features, M. Kilger, “A shadow handler in a video-based real-time traffic monitoring system,” Proc. of IEEE Workshop on Applications of Computer Vision, pp. 11-18, 1992, and D. Koller, K. Danilidis, and H. Nagel, “Model-based object tracking in monocular image sequences of road traffic scenes,” Int'l Journal of Computer Vision, vol. 10, pp. 257-281, 1993.
In general, prior art intrinsic images do not consider motion of foreground objects in a scene. Therefore, it is desired to provide improved intrinsic images that do reflect motion variations in the scene itself.
Furthermore, it is desired to use the improved intrinsic images to improve foreground detection.