The invention relates generally to a system and method for identifying discrete objects within a crowded environment, and more particularly to a system of imaging devices and computer-related equipment for ascertaining the location of individuals within a crowded environment.
There is a need for the ability to segment crowded environments into individual objects. For example, the deployment of video surveillance systems is becoming ubiquitous. Digital video is useful for efficiently providing lengthy, continuous surveillance. One prerequisite for such deployment, especially in large spaces such as train stations and airports, is the ability to segment crowds into individuals. The segmentation of crowds into individuals is known. Conventional methods of segmenting crowds into individuals utilize a model-based object detection methodology that is dependent upon learned appearance models.
A number of surveillance applications require the detection and tracking of people to ensure security, safety, and site management. Examples include the estimation of queue length in retail outlets, the monitoring of entry points, bus terminals, or train stations.
Also, automatic monitoring of mass experimentation on cells involves the high throughput screening of hundreds of samples. An image of each of the samples is taken, and a review of each image region is performed. Often, this automatic monitoring of mass experimentation relates to the injection of various experimental drugs into each sample, and a review of each sample to ascertain which of the experimental drugs has given the desired effect.
Substantial progress has been made in detecting individuals in constrained settings, i.e., frames of reference in which the size and/or shape of the individuals are assumed. To achieve this, it is often assumed that the individuals in the frame of reference are well separated and that identifying foreground objects is possible using a statistical background model. Certain actions can only be detected if the location of all individuals in the frame of reference is known. However, in all of the scenarios just mentioned, the individuals actually appear in groups. Additionally, the number of individuals present in the frame of reference may also be desired to be known.
Various techniques have been applied to construct fast and reliable individual detectors, for example, for surveillance applications. Classification techniques can be applied to decide if a given image region contains a person. The use of Support Vector Machines is one way to approach this problem. Another method for solving this problem is using a tree based classification to represent possible shapes of individuals within a group. Yet another way to determine if a region contains a person is to use dynamic point distribution models.
An alternative to modeling the appearance of an entire individual is to design detectors for specific parts of the individual, such as a head or a foot of a human or the neck of a bottle, and combine the result of the detection of those specific parts. The idea of learning part detectors using Ada-Boost and a set of weak classifiers is one approach to this problem. A learning approach is then being used to combine the set of weak classifiers to body part detectors, which are further combined using a probabilistic person model. All these approaches require a fair amount of training data to learn the parameters of the underlying model. Although these classifiers are robust to limited occlusions, they are not suitable to segment a group into individuals, especially when the individuals within the group have freedom of movement, such as crowds of animals or people.
One way in which segmenting a group into individuals has been achieved is to use the information available from several frames of reference, for example, various views from multiple cameras. For example, the M2-tracker explicitly assigns the pixels in each camera view to a particular person using color histograms. A. Mittal and L. S. Davis, M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo, PROC. 7TH EUROPEAN CONF. COMPUTER VISION, Kopenhagen, Denmark, vol. X, pp. 18-33 (2002)
Another way of resolving multiple camera views is by using the camera calibration to locate possible head locations using a head detector. The locations of all individuals in a scene are estimated by maximizing an observation likelihood using Markov Chain Monte Carlo. In this situation, it is extremely helpful to know the location of the ground plane and the camera parameters, as the head detector is based on edge information. However, under certain imaging conditions, extracting clean edge maps can be challenging. Additionally, installing multiple cameras is expensive, and managing the acquisition and analysis of the images from all of the cameras is complex and expensive.
A traditional non-video way of counting individuals is to use turnstiles. However, the installation of the turnstiles can be costly, both in initial expense and in the loss of floor space and flexibility of entrance/egress.
As such, a need exists for a way to analyze a frame of reference to segment a group into individuals for counting and tracking without the need for multiple frames of reference. A further need exists for using this information to determine likely action scenarios of individuals within a group.