In vision systems, it is a fundamental and crucial problem to extract moving objects from a video. Typical applications in which object extraction is necessary include surveillance, traffic monitoring, teleconferencing, human-machine interface, and video editing. There are many other image processing applications where it is important to be able to distinguish moving pixels from static pixels. It is common to refer to the moving object as a ‘foreground’ object. However, it should be understood that the relative depth at which the object moves with respect to the ‘background’ is not a factor.
Background subtraction is the most common technique for moving object extraction. The idea is to subtract a current image from a static image representing the ‘background’ in a scene. The subtraction should reveal the silhouettes of moving objects. Background subtraction is performed typically during a pre-processing step to object recognition and tracking. Most prior art background subtraction methods are based on determining a difference in pixel intensity values between two images.
Although prior art methods work well, they are susceptible to global and local color in intensity. If the color in portions of the background of an image changes, then the pixel intensity levels change accordingly, and those portions may be misclassified being associated with the foreground or a moving object. These changes cause the subsequent processes, e.g., recognition and tracking, to fail because accuracy and efficiency are crucial to those tasks.
In the prior art, computational barriers have limited the complexity of real-time video processing applications. As a consequence, most prior art systems were either too slow to be practical, or restricted to controlled situations.
Recently, faster computers have enabled more complex and robust systems for analyzing videos in real-time. Such systems can reveal real world processes under varying conditions. A practical system should not depend on the placement of cameras, the content of the scene, or lighting effects. The system should be able to track slow moving objects in a cluttered scene with a moving background, overlapping or occluded objects, objects in shadows and subject to lighting changes, as old object leave the scene, and new objects enter.
Non-adaptive background subtraction methods are undesirable because they require manual initialization to acquire the static background image. Without initialization, errors in the background accumulate over time, making non-adaptive methods useless for unsupervised, long-term tracking applications where there are significant changes in the background of the scene.
Adaptive background subtraction averages pixel intensities over time to generate the background image, which is an approximation of the background at any one time. While adaptive background subtraction is effective in situations where objects move continuously and the background is visible a significant portion of the time, it fails when there are many moving objects, particularly if the moving objects move slowly. Adaptive methods also fail to handle multi-modal, or multiple alternating backgrounds, recover slowly when the background is uncovered and a single, predetermined threshold is used for the entire scene.
Changes in lighting cause many background subtraction methods to fail. Ridder et al. in “Adaptive background estimation and foreground detection using Kalman-filtering,” ICRAM (International Conference on Recent Advances in Mechatronics), 1995, modeled each pixel with a Kalman filter. This modeling made their system less sensitive to lighting changes. While that method does have a pixel-wise automatic threshold, it still recovers slowly.
Koller et al., in “Towards robust automatic traffic scene analysis in real-time,” Proceedings of the International Conference on Pattern Recognition, 1994, adapted Kalman filtering to an automatic traffic monitoring application.
A multi-class statistical model can also be used for tracking objects by applying a single Gaussian filter to each image pixel. After proper initialization with a static indoor scene, reasonable results can be obtained. Results for outdoor scenes are not available.
Friedman et al. in “Image segmentation in video sequences: A probabilistic approach,” Proc. of the Thirteenth Conf. on Uncertainty in Artificial Intelligence, 1997, described a pixel-wise expectation-maximization (EM) method for detecting moving vehicles. Their method explicitly distributes pixel values into three classes representing the road, shadows, and vehicles. It is unclear what behavior their system would exhibit for pixels with other classifications.
Instead of using predetermined distributions, Stauffer et al., in “Adaptive background mixture models for real-time tracking,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Vol. 2, 1999, modeled each pixel as a mixture of Gaussian filters. Based on the persistence and the variance of each mixture, they determined the Gaussian filters that could correspond to background distributions. Pixels that do not fit the background distributions are considered foreground unless there is a Gaussian filter with sufficient and consistent evidence to support a background classification. Their updating has four problems. It considers every pixel in the image. The pixels are processed independently. It uses a constant learning coefficient, and the background image is always updated for every input image, whether or not updating is actually required. This can be quite time consuming, if the updating is at the frame rate.
Bowden et al., in “An improved adaptive background mixture model for real-time tracking with shadow detection,” Proc. 2nd European Workshop on Advanced Video Based Surveillance Systems, AVBS01, September 2001, described a model based on the method of Grimson et al. They used an EM process to fit a Gaussian mixture model for the update equations during initialization.
Horprasert et al., in “A statistical approach for real-time robust background subtraction and shadow detection,” IEEE ICCV'99 FRAME-RATE WORKSHOP, 1999, described a computational color model that considers brightness distortion and chromaticity distortion to distinguish shading background from the ordinary background or moving foreground objects. They also described a process for pixel classification and threshold selection.
In general, there are three major problems with prior art background image updating. First, every input image causes a background update. This wastes time. Second, every pixel in each image is considered during the updating. This wastes even more time. Third, the manner in which the background image is updated is fixed. This is sub-optimal. The invention addresses all three problems.