The inherent ability of our visual system to perceive coherent motion patterns in crowded environments is remarkable. For example, G. Johansson, “Visual motion perception,” Scientific American, 14:76-78 (1975) describes experiments supporting this splendid visual perception capability. Such experiments demonstrate the innate ability of humans to distinguish activities and count independent motions simply from two-dimensional projections of a sparse set of feature points that are manually identified on human joints. When similar experiments were conducted with the video segments in which each feature points were reduced to a swarm of moving bright dots against a dark background, human observers were easily able to detect and classify the moving objects. While human eyes can perform such data extraction operations unconsciously, automated systems still face difficulty in detecting and counting independently moving objects based on feature point trajectories alone.
Previous approaches to detecting multiple moving objects, often humans in particular, include methods based on complex shape models, generative shape models based on low-level spatial features, bag-of-features, low-level motion pattern models, body-part assembly models, and low-level feature track clustering. P. Tu and J. Rittscher, “Crowd segmentation through emergent labeling,” Proc. ECCV Workshop on Statistical Methods in Video Processing (2004) employs a different approach to crowd segmentation by arranging spatial features to form cliques. They posed the multi-object detection problem as one of finding a set of maximal cliques in a graph. The spatial features form the graph vertices, and each edge weight corresponds to the probability that features arise from the same individual. The spatial feature similarity measure was based on the assumption that the vertices lie on the circular contour of a human seen from above. J. Rittscher, P. Tu, and N. Krahnst, “Simultaneous estimation of segmentation and shape,” Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 486-493 (2005) discloses a variant of the expectation-maximization algorithm for the estimation of shape parameters from image observations via hidden assignment vectors of features to cliques. The features are extracted from the bounding contours of foreground blob silhouettes, and each clique (representing an individual) is parameterized as a simple bounding box.
Despite the recent progress in automatically detecting coherent motion patterns in a video stream, however, the level of sensitivity and accuracy in the detection of motion patterns through a machine has not been able to consistently match the performance by human eyes. In view of this, a further improvement in consistent and accurate detection of such coherent motion patterns in a video stream is desired.