1. Field of the Invention
The subject invention relates to video imaging and, more specifically, to automatic detection and segmentation of foreground in video streams.
2. Description of the Related Art
Automatic understanding of events happening at a site is the ultimate goal for intelligent visual surveillance systems. Higher-level understanding of events requires that certain lower level computer vision tasks be performed. These may include identification and classification of moving objects, tracking of moving objects, such as people, and understanding of people interaction. To achieve many of these tasks, it is necessary to develop a fast and reliable moving object segmentation method in dynamic video scenes.
Background subtraction is a conventional and effective approach to detect moving objects. Many researchers have proposed methods to address issues regarding the background subtraction. One prior art method proposes a three-frame differencing operation to determine regions of legitimate motion, followed by adaptive background subtraction to extract the moving region. According to another prior art method, each pixel is modeled as a mixture of Gaussians, and an on-line approximation is used to update the model. Yet another prior art method uses nonparametric kernel density estimation to model the intensity distribution of each pixel, and another calculates the normalized cross-correlation on the foreground region for shadow removal, and uses a threshold to avoid detecting shadow in dark areas. The last method is based on an assumption that the image produced by the pixel level background subtraction contains all possible foreground regions; however, this assumption isn't valid when the pixel level background subtraction fails due to, e.g., similar color between the foreground and background. Moreover, the threshold in the last method is sensitive to various scene changes. Consequently, misclassified pixels are ignored and are not used in the normalized cross correlation calculation. Further information about these methods can be found in:    [1] R. Collins, et al., A system for video surveillance and monitoring: VSAM final report, Carnegie Mellon University, Technical Report: CMU-RI-TR-00-12, 2000.    [2] C. Stauffer, W. Eric L. Grimson, Learning Patterns of Activity Using Real-Time Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22, Issue 8, August 2000, 747˜757.    [3] A. Elgammal, et al., Background and Foreground Modeling using Non-parametric Kernel Density Estimation for Video Surveillance, Proceedings of the IEEE, 2002, 90(7):1151˜1163.    [4] Ying-li Tian, Max Lu, and Arun Hampapur, Robust and Efficient Foreground Analysis for Real-time Video Surveillance, IEEE Computer Vision and Pattern Recognition, San Diego, June, 2005.    [5] Michael Harville, A Framework for High-Level Feedback to Adaptive, Per-Pixel, Mixture-of-Gaussian Background Models, ECCV 2002: 543-560.    [6] Dengsheng Zhang, Guojun Lu, Segmentation of moving objects in image sequence: A review, Circuits, Systems, and Signal Processing, Volume 20, Number 2 2001.3.    [7] Philippe Noriega, Olivier Bernier, Real Time Illumination Invariant Background Subtraction Using Local Kernel Histograms, BMVC 2006.    [8] Toufiq Parag, Ahmed Elgammal, and Anurag Mittal; A Framework for Feature Selection for Background Subtraction, Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Volume 2 table of contents, Pages: 1916-1923.    [9] J. P. Lewis, Fast normalized cross-correlation; In Vision Interface, 1995.The entire disclosure of all of which is incorporated herein by reference.
The prior art methods discussed above are all based on pixel level subtraction. A natural downside is that those methods only compare the difference of each pixel during foreground subtraction, so as to ignore the local region information. As a result, the methods often fail in situations such as:
(1) Similar color between input image and background;
(2) Shadows;
(3) Sudden illumination changes;
(4) Random motion (e.g., shaking leaves in the wind).
That is, even when region information was used in the comparison, the choice of region for the comparison was based on the pixel level comparison. Consequently, the regional information was not used for any of the pixels that were erroneously classified as equivalent to the background.
Accordingly, there is a need in the art for a more reliable and robust method for accurately deciphering foreground pixels in an input video stream.