The accurate classification of objects in an image or series of images is highly desirable in applications such as video surveillance or moving target detection in ground or low altitude air vehicles (manned or unmanned). Such applications need to detect moving objects in an operating environment; to automatically detect potential targets/threats that pop-up or move into view for military vehicles and alert an operator of a vehicle to these potential threats; and for the safe operation of (unmanned) ground vehicles, where there is a need to detect moving and stationary pedestrians/dismounted personnel in order to prevent accidents. In such applications it is desirable to verify in an entire image or image patch (region) the presence or absence of instances of particular object classes such as cars, people, bicycles, etc. The problem is very challenging because the appearance of object instances in the same category varies substantially due to changes in pose, aspect and shape. Ideally, a representation should be flexible enough to cover a wide range of visually different object classes, each with large within-category variations, while still retaining good discriminative power between the object classes.
“Part” or “fragment” based models, which combine local image features or regions into loose geometric assemblies, offer one possible solution to this problem. Constellation models provide a probabilistic way to mix the appearance and location of local descriptors. However, one of the major limitations of constellation models is that they require an explicit enumeration over possible matching of model features to image features. This optimal, but expensive step limits the model to a relatively few detected features. Thus, to keep computational requirements low, a large amount of available image information must be ignored, especially in cases where objects in an image or video stream have many parts.
A “bag-of-features” representation, which models an image as an orderless collection of local features, has become increasingly popular for object categorization due to its simplicity and good performance. Bag-of-features representations evolved when texton based texture analysis models began to be applied to object recognition. “Bag-of-features” representations are analogous to “bag-of-words” representations used in document analysis, in which image patches are the visual equivalents of individual “words” and the image is treated as an unstructured set (‘bag’) of patches. One bag-of-features representation known in the art is described in “Learning Compositional Categorization Models”, Proceedings European Conference on Computer Vision (ECCV06), 2006 (hereinafter “Ommer and Buhmann”). Ommer and Buhmann describes a composition of individual features as the basic unit in bag-of-features representation. However, using individual features in a bag-of-features representation has been shown to be not very discriminative, which makes the model susceptible to classifying background features as part of a desired feature of interest. In addition, the ignorance of the spatial relations among local features also severely limits the descriptive ability of the representation. Moreover, such models cannot deal with large within-category variations of the same object caused by aspect, pose and shape variations.
Accordingly, what would be desirable, but has not yet been provided, is a more discriminative method for creating a strong (i.e., highly discriminative) classifier that effectively and automatically classifies objects in one or more images of a video sequence or datastream.