A detection of foreground objects and/or moving objects is a common part of a video-based object tracking operations used in computer vision applications, such as surveillance, traffic monitoring, and traffic law enforcement, etc. Example applications can include, inter alia, vehicle speed estimation, automated parking monitoring, vehicle and pedestrian counting, traffic flow estimation, measurements of vehicle and pedestrian queue statistics, and TET (total experience time) measurements in retail spaces, etc.
Two of the most common methods of motion detection and/or foreground detection used in applications that perform analytics on video data include frame-to-frame differencing and background estimation and subtraction (“background subtraction”). The frame differencing approach detects moving objects, typically by requiring tuning to a very narrow range of object speed relative to the frame rate and camera geometry. Such temporal differencing of frames is often used to detect objects in motion, but fails to detect slow-moving (relative to the video frame rate) or stationary objects, and may also fail to detect fast-moving objects (relative to the video frame rate). The background subtraction approach detects foreground objects, which includes moving objects and may include stationary objects that are not part of the static scene or background. Moving objects can trigger foreground detection because typically their appearance differs from the background estimate. Background subtraction is more flexible in terms of adjusting the time scale and dynamically adjusting parameters in the background modeling. Although the term “subtraction” refers to a ‘minus’ operation in arithmetic, it often refers to the removal of a component of an image in computer vision and video processing applications. As such, the term can refer to operations including pixel-wise subtractions between images and pixel-wise statistical fit tests between pixel values in an image and a set of corresponding statistical models.
While background estimation can be performed in several ways, the subject disclosure focuses on approaches that construct statistical models describing background pixel behavior. According to this approach, a statistical model (e.g., a parametric density model such as a Gaussian Mixture Model, or a non-parametric density model such as a kernel-based estimate) describing the historical behavior of values for each pixel is constructed and updated continuously with each incoming frame at a rate controlled by a predetermined learning rate factor. Foreground detection is performed by determining a measure of fit of each pixel value in the incoming frame relative to its corresponding constructed statistical model: pixels that do not fit their corresponding background model are considered foreground pixels, and vice versa. The main limitation of this approach is that the choice for a learning rate involves a tradeoff between how fast the model is updated and the range of speed of motion that can be supported by the model. Specifically, too slow a learning rate would mean that the background estimate cannot adapt quickly enough to fast changes in the appearance of the scene (e.g., changes in lighting, weather, etc.); conversely, too fast a learning rate would cause objects that stay stationary for long periods (relative to frame rate and learning rate) to be absorbed into the background estimate.
Statistical modeling of historical behavior of pixel values is often used as a method for background estimation that is more robust than other adaptive methods such as temporal frame averaging, which, while less computationally intensive, are incapable of supporting wide ranges of speeds of motion, or accurately describing background images when the distribution of its pixel values is not unimodal.
Thus a main weakness of traditional model-based background estimation algorithms is that the learning parameter a has to be carefully selected for the expected range of object velocity in the scene relative to the frame rate. As noted above, too slow a learning rate would mean that the background estimate cannot adapt quick enough to fast changes in the appearance of the scene, and too fast a learning rate would cause objects that stay stationary for long periods (relative to the frame rate and the learning rate) to be absorbed into the background estimate.
With reference to FIG. 1, a traditional process for background estimation and updating, and foreground detection is illustrated. This prior art system 10 utilizes video data from a plurality of frames including a set of precedent frames 12 (Ft−1 - Ft−4) to establish a background model BGt of 14. A current frame Ft 16 is provided to the fit test module 18 along with a background model 14 to generate the current foreground mask FGt 20. The background model BGt, along with frame Ft are input to the model update module 24, i.e., BGt+1 22, is generated by background model update module 24 with only inputs from the current background model 14 and the current frame 16. It is noteworthy that the current foreground mask 20 is not employed by the background update module 24 to generate the next subsequent background model 22. The background models 14, 22 are sets of pixel-wise statistical models, the foreground mask is a binary image representation, and the frames 12, 16 are grayscale/color images.
There is thus a need for improved methods and systems for background estimation from video image data which quickly responds to variations in the background such as environment changes or lighting changes, while avoiding the problem of absorbing foreground objects into the background estimation model. More particularly, such systems and methods would compose a background model of a different statistical model for each pixel wherein each pixel-based statistical model is updated only when that particular pixel does not contain a foreground object.
The terms “background image” or “estimated background” refer to an image having the same pixel dimensions as the current frame. The background image contains a current estimate of a background in the form of pixel-wise statistical models of the historical values of every pixel (e.g., a mixture of Gaussians). These statistical models are maintained throughout the duration of the foreground detection task and updated with every incoming frame. In order to convert this pixel-wise model set to an image representation, single values representative of the distribution can be selected. For example, each pixel value in the background frame may be assigned a value equal to the weighted sum of means of the individual Gaussian components. In another embodiment, the mode, the median, or more generally, a statistic representative of the distribution can be used.
The terms “foreground image” and/or “foreground mask” refer to a classification mask, such as, for example, a binary mask, with pixel values equal to one (1) at coordinates corresponding to a location of a detected foreground pixel. It should be noted that alternative labeling and/or representations of the foreground mask can be used, as long as pixel values corresponding to the detected foreground object have a label that distinguishes it from the background.