The invention relates to video surveillance systems and, more particularly, to a method and apparatus for tracking motion in a three dimensional space in video surveillance systems.
In an era of increasing fear of terrorism and corporate scandal, video surveillance has become an increasingly essential part of security. While surveillance cameras can be useful in recording wrongdoings, their greater value is realized when they prevent such acts.
Generally speaking, fixed cameras provide video surveillance for a spatially limited area. As the size of the area increases, the number of cameras that are needed for a given level of surveillance also increases. For instance, using fixed cameras to survey the border between two nations might require thousands of cameras. The same could be true for protecting a large corporation. While most office buildings do not require thousands of cameras for adequate surveillance, tens or hundreds may be needed.
A source of problems with the prior art is that security personnel are required to monitor these cameras either in real time or during replay. There are limits on how many cameras an individual can watch at an acceptable level of efficiency. For instance, studies have shown that an average person can only watch four to five cameras at an acceptable level of efficiency.
Most surveillance really only concerns moving objects. The signal processing of video images generally and, more specifically, for identifying moving objects, is not new. For example, U.S. Pat No. 5,930,379 discloses modeling an object as a branched kinematic chain composed of links connected at joints. Groups of pixels having like motion parameters are assigned to the links. Motion parameters are estimated until the groups of pixels and their motion parameters converge and can be identified with the moving object.
U.S. Pat. No. 5,987,154 discloses detecting a moving object, calculating the local extremes of curvature of the boundaries of the moving object, comparing the local extremes with a stored model of a human head in order to find regions shaped like a human head, and identifying the head with a surrounding shape.
U.S. Pat. No. 6,049,619 discloses a stratified moving object detection technique which gradually increases in complexity as scene complexity increases (from least complex to most complex): (i) scenarios in which the camera induced motion can be modeled by a single two-dimensional parametric transformation, (ii) those in which the camera induced motion can be modeled in terms of a small number of layers of parametric transformations, and (iii) general three-dimensional scenes, in which a more complete parallax motion analysis is required.
U.S. Pat. No. 6,081,606 discloses processing a sequence of images and generating a flow field representing the motion within a scene. The flow field is a vector representation of the motion of the scene that represents both the magnitude and the direction of the motion. The flow field is generated by correlating at least two frames in the sequence of images. This flow field is analyzed by a flow field segmentor to determine the magnitude and direction of motion within the scene and segment the motion information from the static portions of the scene. An alarm detector then processes the motion information to determine if an alarm should be generated based upon the motion information.
U.S. Pat. Nos. 6,188,777 and 6,445,810 disclose marking and tracking regions of homogenous color. In one approach, each image received at a primary camera is initially represented with pixels corresponding to the red, green, and blue channels of the image, and is converted into a “log color-opponent” space. This space can represent the approximate hue of skin color, as well as its log intensity value. More specifically, (R, G, B) tuples are converted into tuples of the form (1(G), 1(R)−1(G), 1(B)−(1(R)+1(G))/2), where 1(x) indicates a logarithm function. In another approach, a lookup table is precomputed for all input values, quantizing the classification score (skin similarity value) into 8 bits and the input color channel values to 6, 7, or 8 bits.
U.S. Pat. No. 6,504,951 discloses classifying potential sky pixels in the image by color, extracting connected components of the potential sky pixels, eliminating ones of the connected components that have a texture above a predetermined texture threshold, computing desaturation gradients of the connected components, and comparing the desaturation gradients of the connected components with a predetermined desaturation gradient for sky to identify true sky regions in the image.
Comaniciu et al., “Distribution Free Decomposition of Multivariate Data,” Pattern Analysis & Application, 2:22-30 (1999) discloses using a mean shift technique to decompose multivariate data. An iterative technique along with density estimate functions are used to reduce a large dataset to the few points that best describe the data.
Grimson et al., “Using Adaptive Tracking to Classify and Monitor Activities in a Site,” Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 22-31, (1998) discloses using camera coordinates of objects that pass through the fields of view of cameras, along with time, to find correspondences between the cameras. Once all the camera views are mapped onto one camera view, this mosaic camera view can be mapped onto a virtual overhead plane. Coordinates in the virtual overhead plane are used to track moving objects.
Horprasert et al., “A Robust Background Subtraction and Shadow Detection,” Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan (January 2000) discloses using chromaticity data separate from intensity data to do background subtraction. All colors are treated as being on a line in a three dimensional space. The difference between two colors is calculated as the distance between their color lines.
Stauffer et al., “Adaptive Background Mixture Models for Real-Time Tracking,” Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 246-252 (1999) discloses performing background subtraction with models that change over time. More specifically, each pixel is represented by multiple Gaussian distributions.
None of these documents describe a technique for monitoring multiple cameras with fewer personnel and/or at greater efficiency by electronically filtering and alerting personnel as to which cameras show unauthorized activity and, more importantly, relieving personnel from some monitoring authorized activity as is described below.