1. Technical Field
The present invention relates generally to video analysis and, more specifically, to a system and method for video analysis techniques so as to automatically screen video streams to identify events of interest.
2. Description of Related Art
With the increasing use of video surveillance and monitoring in public areas to improve safety and security, techniques for analyzing such videos are becoming increasingly important. There are various techniques that are utilized or have been proposed for video analysis. The current generation of (closed-circuit television) CCTV systems are primarily visual aids for a control operator who then analyzes the video for unusual patterns of activity and takes specific control actions. However, as the number of deployed cameras increase, monitoring all the video streams simultaneously becomes increasingly difficult and the likelihood of missing significant events of interest is quite high. Therefore, automated video analysis using Computer Vision techniques is of interest.
There has been significant research in modules and systems for video surveillance and monitoring in recent years. These surveillance systems generally involve several fundamental steps: change detection and segmentation (to identify objects different from background in the scene), tracking (using motion analysis to identify and track people/objects), illumination adaptation (for adaptation to change in illumination if the system is to be deployed in outdoor settings, handling of shadows in both indoor/outdoor settings), event detection (action detection), and reasoning.
Analyzing video for use in surveillance situations requires real-time processing on compressed video streams, low cost, camera viewpoint, etc. Many surveillance scenes which involve intermittent high traffic, for example, a subway platform, have illumination conditions characterized by near static situations mixed with occasional sudden changes due to changes in the platform state (e.g., extreme ambient illumination changes, shadowing, etc. due to train arrivals/departures in the scene). In addition, the information space is physically corrupted due to factors such as low quality cameras, noise during signal transmission, and quantization due to compression.
We now provide a survey of related art in the field. The 2-D motion detection problem has been widely investigated from the very beginning of Computer Vision since it provides a good basis to deal with high level tasks of computer vision such as motion estimation, tracking, robotics, depth recovery, etc. Prior literature on object detection using motion cues can be classified from two viewpoints: 1) In cases where a reference frame of the background scene is available, the task of motion detection is equivalent to the task of background subtraction that aims at locating the areas of the image domain that are different from the background reference frame. 2) In cases where the background scene is dynamically changing such that an image of the background is not available, then the problem is equivalent to change detection where the proposed solutions are based on the inter-frame difference (or those that use update methods that statistically model and update the changing scene).
Simple approaches for change detection use thresholding techniques. The motion detection map is obtained by applying pixel-wise (or block-wise) thresholding criteria to the observed difference image. However, such approaches do not have a robust behavior with respect to noise. In addition, the automatic determination of the threshold is an issue.
This issue was resolved by performing statistical analysis on the observed distribution of the difference frame. The statistical analysis involved approximating the frame difference value distribution by using a mixture model. Gaussian or Laplacian distributions were assumed for the component distributions for pixels whose difference values are from different hypotheses: e.g. pixels corresponding to static objects or mobile objects. A motion detection map can then be automatically determined using Bayes rule by using the observed difference frame (i.e. data) and the aposteriori probabilities of the different hypotheses given the data. While these methods are improvements over ad-hoc pixel based classification schemes, they suffer from locality since higher-order interactions across pixels are not modeled and the decisions are taken locally at a pixel-wise level.
This constraint can be dealt with by the use of more complex models where local interaction between neighboring pixels can be introduced. For example, the use of Markov chains was proposed where the motion detection problem was viewed as a statistical estimation problem. However, these methods were constrained to interactions among lines or columns and hence they had limited applicability. In addition, the use of spatial filters was proposed for situations where some a priori knowledge is available. Although these approaches demonstrate very good performance in controlled environments, they lack generality and are not able to deal with deformations as well as global illumination changes.
A further attempt to solve the motion detection and tracking problem involved the formulation with spatial (local) interaction constraints in the form of Markov Random Field model. In this framework, the motion detection map is obtained by maximizing the a posteriori segmentation probability of a joint probability density function (that incorporates the local Markov property) for the likelihood of label assignments given observations. The main advantage of this approach is that it is less affected from the presence of noise, and provides a global segmentation criterion. The optimization problem turns to be equivalent to the minimization of a global objective function and is usually performed using stochastic (Mean-field, Simulated Annealing) or deterministic relaxation algorithms (Iterated Conditional Modes, Highest Confidence First). However, although the Markov Random Field-based objective function is a very powerful model, usually it is computationally expensive, and this may be perceived as a handicap.
Accordingly, an efficient and accurate real-time video analysis technique for identifying events of interest, and particularly, events of interest in high-traffic video streams, which does not suffer from locality and which can handle deformations and global illumination changes, is highly desirable.