1. Technical Field
The present disclosure relates to crowd segmentation and, more specifically, to fast crowd segmentation using shape indexing.
2. Discussion of Related Art
The detection, localization and tracking of human subjects within a video stream have many practical applications. For example, detection, localization and tracking may be used to design an intuitive computer interface in which users can control a computer system with their movements. Additionally, in the field of security and surveillance, it is particularly useful to know how many human subjects appear in the video and where they are located.
Detection, localization and tracking of an isolated individual generally does not pose a particular challenge for computer vision systems; however, when the video stream includes multiple people in close proximity where people can partially occlude each other, detection, localization and tracking can become particularly difficult. Techniques have been developed for isolating individuals from within a group, and these techniques have been known as “crowd segmentation.”
Conventional approaches to crowd segmentation may be grouped into three categories. The first category includes appearance-based approaches. These approaches may involve the use of “head detection,” where the video image data stream is inspected for the occurrence of the “Ω” shape that is generally associated with the contour of a person's head and shoulder. However, when using this approach, the head cannot always be reliably detected across different viewing angles and far distances. Accordingly, head detection techniques alone are often insufficient to accurately segment a crowd. Other appearance-based approaches may use learned local appearance descriptors. For example, B. Leibe, E. Seemann, and B. Schiele, Pedestrian Detection in Crowded Scenes, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 878-885, 2005, relates to an interest-point and local feature descriptor-based detector, followed by global grouping constraints, used to detect humans.
B. Wu and R. Nevatia, Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors, Proc. Intl. Conference on Computer Vision, 1:90-97, 2005, relates to a parts-based human detector that is extended to handle multiple humans. Such approaches of the related art are complex, computation intensive, and may also be ineffective when used for arbitrary surveillance situations.
Grouping based approaches typically use motion features to isolate tracks of people, and infer their positions in frames. For example, V. Rabaud and S. Belongie, Counting Crowded Moving Objects, Pro. Ieee Conf on Computer Vision and Pattern Recognition, 1:705-711, 2006; and G. Brostow and R. Cipolla, Unsupervised Bayesian Detection of Independent Motion in Crowds, Proc. IEEE Conf on Computer Vision and Pattern Recognition, 1:594-601, 206, relate to the use of clustering in space and time, trajectories over several frames for coherence. This method is used to count moving objects in dense crowds but is not satisfactory for localization.
Generative model based parameter optimization approaches model the image formation process as parameterized by the attributes of humans in the scene. The parameter set that best explains the observed image may then be identified.
J. Rittscher, P. Tu, and N. Krahnstover, Simultaneous Estimation of Segmentation and Shape, Proc. IEEE Conf on Computer Vision and Pattern Recognition, 2:486-493, 2005, relates to partitioning a given set of image features using a likelihood function that is parameterized on the shape and location of potential individuals in the scene. This approach uses a variant of the Expectation Maximization algorithm to perform global annealing based optimization and finds maximum likelihood estimates of the model parameters and the grouping.
In A. E. Elgammal and L. S. Davis, Probabilistic Framework for Segmenting People Under Occlusion, Proc. IEEE Intl. Conf on Computer Vision, 2:145-152, 2001, humans are assumed to be isolated as they enter the scene so that a human specific color model can be initialized for segmentation when occlusion occurs later. One particular problem with this approach is that the initial assumption is not necessarily valid in crowded situations.
In M. Isard and J. MacCormick, Bramble: A Bayesian Multiple-Blob Tracker, Proc. IEEE Intl. Conf on Computer Vision, 2:34-41, 2001, a generalized cylinder based representation is used to model humans and their appearance. The number and positions of the humans are then tracked using a particle filter.
T. Zhao and R. Nevatia, Bayesian Human Segmentation in Crowded Situations, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2:12-20, 2003, relates to a generative process where parameters including the number of people, their location and their shape are used to track individuals. This technique uses Markov Chain Monte Carlo (MCMC) to achieve global optimization by searching for maximum likelihood estimates for the model parameters. These approaches may be complicated and may involve the use of a high dimensional parameter space. Accordingly, the process for searching for the best parameters may be particularly slow.
Accordingly, the most effective techniques of the related art are highly complex and may thus require the use of costly hardware and/or may not be fast enough to perform detection, localization and tracking of individual human subjects within a crowd from a video stream in real-time.