1. Field of the Invention
The present invention relates to processing of image sequences for use in detecting events that occur in a scene; in particular the invention deals with analysis of user-defined areas in the image sequence to detect transient objects, especially persons, and their activity in the scene.
2. Background of the Invention
Automatic detection of objects in a scene is an important problem with many applications. The popularity of image-capturing devices attached to computers has made it possible to use image analysis to detect objects in a scene. One approach is to model the transitions between the background scene and foreground objects. U.S. Pat. No. 5,465,115 by G. Conrad, B. Denenberg and G. Kramerich, disclosed a method for counting people walking through a restricted passageway, such as a door. Using the image taken from an overhead video camera, a narrow image window perpendicular to the traffic flow is analyzed. This narrow image is divided into short image segments, called “gates”, which are points of detection. The thresholded differences of these gate images across consecutive frames are used to detect persons crossing the gate. Since the gates' widths are small, one passing person would occlude several contiguous gates. The method depends heavily on the speed of the moving objects; they have to be fast enough to record significant frame-to-frame differences. The event of one person passing can even register several significant differences. Certainly, the method would not work if a person stops in the location of the window.
The more popular approach using image analysis is to create a model of the image of the background scene and classify regions in a scene's image as either background or foreground, which is the object region. Many computer vision systems use a background image model to detect and track objects in an imaged scene. The idea is to have a system that maintains a model of the background, which is the scene without any objects in it. This model can be an image of the background or some statistics of its sub-images. Whenever a small region of an input image is significantly different from the corresponding region in the background model, the location of that region is tagged as foreground, which means part of an object region. Spatial and temporal post-processing techniques refine the localization of the object region.
In the work of S. McKenna, S. Jabri, Z. Duric, and H. Wechsler, “Tracking interacting people”, Proceedings of 4th IEEE Int'l Conference on Automatic Face and Gesture Recognition, pp. 348-353, March 2000, each color pixel in the background image is modeled as separate Red, Green, and Blue Gaussians. Each has a mean and variance which are continuously adapted to account for slow changes in outdoor illumination. The corresponding pixel in an input image is considered foreground if its value is a few standard deviations away from the mean. Gaussian models for chromaticity and gradient are added to classify object shadows as part of the background.
Instead of maintaining the distribution parameters for each pixel, K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Principles and Practice of Background Maintenance”, Proceedings of 7th IEEE Int'l Conference on Computer Vision, pp. 255-261, September 1999, maintains the past values of the pixel and uses a linear function to predict the value of the background pixel in the next frame. If the pixel in the next frame deviates significantly from its predicted value, then it is considered foreground.
The gradual and sometimes sudden changes in scene illumination, plus the problem of slow-moving objects, require a more complex model for each background pixel value. C. Stauffer and W. Crimson, “Learning Patterns of Activity Using Real-Time Tracking”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 747-757, Vol. 22, August 2000, proposed multiple, adaptive Gaussians to model to each pixel. Multiple Gaussians are maintained for the background, and pixels that do not match any one of these Gaussians are considered foreground.
In the work of I. Haritaoglu, D. Harwood and L. Davis, “W4: Real-Time Surveillance of People and Their Activities”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 809-830, Vol. 22, August 2000, each background pixel is represented by three values: its minimum, its maximum and maximum frame-to-frame intensity difference during a training period. This initial training period of about 20 to 40 seconds is when the background model is initially learned, which takes place even when objects are moving around the scene. This method is designed for scenes where an object cannot stay in one place very long. Otherwise, it will be included as part of the background. One such situation is when a car stops in the middle of the scene and stays there for minutes.
U.S. Pat. No. 5,748,775 by M. Tsuchikawa, A. Sato, A. Tomono and K. Ishii, disclosed a method for continuous reconstruction of the background image for background subtraction. A temporal histogram is maintained for each pixel and a statistical method is applied to determine whether the temporal changes are due to abrupt or gradual illumination changes. Background pixels are updated using statistics of the past values, computed from the temporal histogram.
Instead of working with the whole image of the background, U.S. Pat. No. 6,546,115 by W. Ito, H. Yamada and H. Ueda, disclosed a method wherein the image view is divided into several sub-views, each of which has a maintained background image of the sub-view. The system is used for detecting objects entering the scene. Each sub-view is processed independently, with the system detecting movement in the sub-view by computing the intensity differences between the input sub-image and the corresponding background sub-image. If an object is detected in a sub-view, then the corresponding background sub-image is not updated, otherwise it is updated.
Dividing the background image into several sub-images is also employed in U.S. Pat. No. 5,684,898 by M. Brady and D. Cerny, which disclosed a method for distinguishing foreground from background pixels in an input image for background subtraction. A weighting function takes the difference between pixel intensities of the background sub-image and that of the corresponding pixels in the input sub-image, and the weights are used to classify pixels in the input sub-image. If a background sub-image is not significantly different from the current input sub-images, it is updated. The method would work well if the objects are constantly moving in the scene.
U.S. Pat. No. 6,625,310 by A. Lipton, et al, utilizes statistical modeling of pixels to segment a video frame in background and foreground. Each pixel has a statistical model for the background. Corresponding pixels of incoming video frames are tested against the background model. If the value of the current pixel does not match the background model then it is considered as foreground.
The main idea behind the image-based methods of background subtraction is that a point location in the scene is classified as occluded by a foreground object if its pixel value in the input frame is significantly different from the expected value in the background model. The model is initially learned from an image sequence and continuously updated in order to adapt to gradual changes in scene illumination.
These methods suffer from problems resulting in the corruption of the background model. A requirement of these methods is that objects should be constantly moving so as not to occlude parts of the background for long periods of time. When an object stays too long in one location, then the history of past values or their distribution becomes significantly different from that of the true background. Furthermore, when an object is in one location and the ambient illumination changes drastically, then the background model in that location could be permanently corrupted, even if the object moves later and exposes the background.
The root of these problems lies in the difficulty of modeling the value of a background pixel over time. The value of a pixel is not stable over a wide range of imaging conditions, especially outdoors where intensity values can gradually or drastically change and have a wide range across different times of the day and varying weather conditions. Image-capturing devices, particularly those that employ auto-exposure, also contribute to the variability of the pixel value.
Contrasts between pairs of pixels, however, are much more stable under varying illumination conditions than a pixel's value. When the ambient illumination brightens or darkens the scene, the values of individual pixels would change significantly. However, given two pixel locations A and B with a small distance between them, the contrast between A and B does not change much, even if the individual values of A and B would increase or decrease significantly. The fact that the brightness of A changes together with the brightness of B results in a stable contrast between the two.
When an object moves into the scene and occludes the pixel location C but does not occlude that of B, the result is a big change in the contrast of pixels C and B. Thus, the contrast between C and B serves as one “micro-signature” of pixel C in the background scene. The ability to detect the occlusion of C by a foreground object, plus the stability under lighting changes, makes the two-pixel contrast an effective detection feature that is invariant to changing illumination conditions.
The idea of a pair of pixels is extended to a group of nearby pixel locations around a central pixel location C. The group is arranged in a circular configuration and has a certain radius R, with each pixel having a distance R from pixel location C. A contrast function can be defined, which uses the values of pixels in the group and computes the contrast of each value with that of pixel location C. The contrast function and a classification criterion determine if pixel location C is occluded by a transient object. The classification criteria compares the value of the contrast function with its expected value for the non-occluded background. That expected value is computed from a reference background image that is computed over a learning period.
The idea is further extended to multiple groups of pixels around C, where each group is arranged in a circular configuration with a different radius. The multi-scaled groups in circular configurations will ensure that the occlusion of pixel location C is detected, regardless of the size of the occluding object. These groups of pixels, together with their contrast with pixel location C, would be called a “multi-scaled invariant” defined for a single pixel location C.
A region of interest, composed of a group of contiguous pixel locations, can be defined as any arbitrary shape that could be a line, a rectangle, or any closed polygon. Each pixel location in the region of interest has a classifier comparing the multi-scale invariant of the pixel location C in the current frame with the multi-scale invariant of the pixel location C in the computed reference background image. With a classifier for each pixel, it is therefore determined if the defined region of interest is occluded by a transient object and to find out what part of the region of interest is covered by the object. Multiple user-defined regions of interest in the imaged scene, together with states of occlusion over time, allow users to define scene events, or “occlusion events”, that can be automatically detected by the system.
For the background scene, the multi-scale invariants are computed from the reference background image learned during a learning period. Once the multi-scale invariants are learned, the key advantage of the system is that the detection of region occlusion can be done per image, independent of the temporal information in the previous images in the image sequence. This is in contrast to the image-based techniques described in the prior art where information from previous images is used to detect occluding objects and to constantly update the background image model.
The present invention is described in the following section and illustrated by an exemplary embodiment together with its accompanying drawings.