As is known in the art, detection of diverse objects in cluttered, natural scenes rapidly and accurately has many real-world applications such as robot navigation, human-computer interaction, image retrieval, and automated surveillance. One challenge is to deal with large variations in shape and appearance of the objects within an object category, as well as the variations resulting from changes in viewpoint, lighting and imaging device.
Many methods used to recognize objects have focused on texture-based interest-points, see for example, [K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 65(1-2), 2005]. These features are typically based on quantitative measurement of filter responses, and placed at informative regions such as corners, blobs and T-junctions. They have been used as the atomic input in the visual process of both the part-based model, see for example, [R. Fergus, P. Perona, and Z. Zisserman. Object class recognition by unsupervised scale-invariant learning. In CVPR. 2003] and the bag-of-features method see: [G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV Work-shop on Statistical Learning in Computer Vision, 2004]; [L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR, 2005]; [K. Grauman and T. Darrell. Efficient image matching with distributions of local invariant features. In CVPR, 2005]; and [S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR. 2006].
Although interest-points have been very effective on wide baseline matching and single object recognition see: [H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. In ECCV, May 2006]; V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-time keypoint recognition. In CVPR, 2005]; and [D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV. 60(2):91-110, 2004], they seem to be less optimal for categorical object detection. The main reason is that interest-points are designed to capture specific image structures, while an ideal feature representation should adapt to the shape that is common to the object category and exhibit different levels of complexities.
Recently, there has been an impressive body of work on using contour information to address these limitations. Shotton et al. [J. Shotton, A. Blake, and R. Cipolla. Contour-based learning for object detection. In ICCV 2005] explore an object detection system that exploits only contour fragment. Opelt et al. [A. Opelt, A. Pinz, and A. Zisserman. A boundary-fragment-model for object detection. In ECCV, 2006] propose the boundary-fragment-model (BFM). Both papers used Adaboost for feature selection. Ferrari et al. [V. Ferrari, T. Tuytelaars, and L. Van Gool. Object detection by contour segment networks. In ECCV, 2006] present a family of scale-invariant shape features formed by chains of connected and roughly straight contour segments. These methods focus on the object shape and demonstrated promising capability of dealing with appearance variations. In fact, contour-based features have been extensively used and can be dated back to the model-based recognition work in early years [E. Grimson. From Images To Surfaces: A Computational Study of the Human Early Vision System. MIT Press, Cambridge, Mass., 1981].
Other related techniques known in the art include: gradient histogram based features such as SIFT [D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91-110, 2004]; shape context [S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24(4):509-522, 2002]; and HOG [N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005].
A work most related to the present invention is the local tag arrangement (LTA) proposed by Amit [Amit, D. Geman, and B. Jedynak. Efficient focusing and face detection. Technical Report 459, Department of Statistics, University of Chicago, 1997.] in the context of face detection. In their framework, local features are represented by spatial arrangements of edge fragments in a rectangular region.
In accordance with the present invention, a method is provided for generating a master map for a generic class of objects, comprising: selecting a subset of frequent templates from a template pool having a plurality of templates having various degrees of complexity; formulating feature selection algorithm to determine a most discriminative template from a pre-selected one of the templates in the template pool.
In one embodiment, the degree of complexity is controllable; and generating the master map from the formulated feature selection algorithm.
In one embodiment, the templates have segment regions, each one of such regions being adapted to have therein fragments having a predetermined size and one of a predetermined plurality of different spatial orientation and wherein the degrees of complexity is varied by the number of fragments in the templates.
In one embodiment, a method is provided for generating a master map for a generic class of objects. The method includes: (A) defining a template having segment regions, each one of such regions being adapted to have therein features having a predetermined size and one of a predetermined plurality of different spatial orientation; (B) obtaining images of different types of objects within the generic class of objects; such images being scaled to a common size and partitioned into image regions, each one of the image regions having a common region of the obtained images, such common region providing a region stack; (C) for each one of the region stacks: (a) applying the template to each one of the images in such region stack to extract, from each one of the images, features having the predetermined size and one of the predetermined plurality of different spatial orientations, to generate, for each one of the images in the region stack, an extracted template; (b) determining, from the extracted templates, a most frequent extracted template among the extracted templates having only a first predetermined number of features with a common spatial orientations; (c) recording the number of images in the region stack having the determined most frequent extracted template; (d) repeating (b) and (c) with successively increasing predetermined number of features until the number of recoded images falls below a predetermined threshold; (e) selecting as a master extracted template for such one of the region stacks, the one of the most frequent templates having the largest recorded number of features; (D) combining the master extracted templates for each one of the region stacks into a map for the class of objects; and (E) comparing the map with each one of a plurality of background images to remove, from the map master extracted, extracted templates therein matching segment characteristics of the background to produce the master map for the class of objects.
In one embodiment the features are edge fragments of the object.
In one embodiment, a method is provided for generating a master map for a generic class of objects. The method partitions images of different types of objects within a class into region stacks. For each one of the stacks, the method: (a) applies a template to extract fragments having a predetermined size and one of a plurality of different spatial orientations, to generate extracted templates; (b) determines, from the extracted templates, a most frequent one thereof having only a first number of fragments with a common spatial orientations; (c) records the number of images having the determined most frequent extracted template; (d) repeats (b) and (c) with successively increasing number of fragments until the number of recoded images falls below a threshold; and (e) selects as a master extracted template the one of the most frequent templates having the largest recorded number of fragments. The master extracted templates for the stacks are combined into a map that is then compared with background images to remove extracted templates matching segment in the background.
In one embodiment, a method is provided for generating a master map for a generic class of objects. The method defines a template having segment regions, each one of such regions being adapted to have therein fragments having a predetermined size and one of a predetermined plurality of different spatial orientation. The method obtains images of different types of objects within the generic class of objects; such images being scaled to a common size and partitioned into image regions, each one of the image regions having a common region of the obtained images, such common region providing a region stack. For each one of the region stacks, the method: (a) applies the template to each one of the images in such region stack to extract, from each one of the images, fragments having the predetermined size and one of the predetermined plurality of different spatial orientations, to generate, for each one of the images in the region stack, an extracted template; (b) determines, from the extracted templates, a most frequent extracted template among the extracted templates having only a first predetermined number of fragments with a common spatial orientations; (c) records the number of images in the region stack having the determined most frequent extracted template; (d) repeats (b) and (c) with successively increasing predetermined number of fragments until the number of recoded images falls below a predetermined threshold; and (e) selects as a master extracted template for such one of the region stacks, the one of the most frequent templates having the largest recorded number of fragment. The method combines the master extracted templates for each one of the region stacks into a map for the class of objects and then compares the map with a plurality of background images to remove, from the map master extracted, extracted templates therein matching segment characteristics of the background to produce the master map for the class of objects.
The present invention differs from LTA in a number of ways. First, the present invention captures long range line structures (e.g., edges) instead of isolated edge pixels. Second, the present invention learns feature templates with variable complexities instead of a fixed configuration. This property is crucial since it is desirable that the feature to adapt the object shape and avoid over/under-representation. Finally, the detection model in LTA is purely generative. It provides interpretable and repeatable features but the model discriminativity has been ignored. With the present invention, the method uses a hybrid of generative and discriminative model for feature selection. The learned features retain both interpretability and discriminativity.
In one embodiment, the method uses an edge-fragment based feature for object detection, where the term detection refers to both image categorization and object localization. The object is represented by a collection of templates. Each template is defined by a group of local edge fragments. In contrast to the traditional interest-point features, edge fragments can be detected stably on the object boundary despite large shape deformations, and can be matched largely invariant to illumination changes and object colors. What is more appealing is that edge detection and tracing are very efficient. By exploring the local and global edge configuration, the method can drastically reduce the object search into a few number of regions of interest (ROIs) with minimum computation and miss detections. More sophisticated classifiers can be further introduced to verify each preliminary detection.
The template is referred to as a Flexible Edge Arrangement Template (FEAT), as it offers a great deal of flexibility by varying the extent and orientation of individual edge fragments, as well as the number of edge fragments and their spatial distribution within each template. However, the richness of this template pool also renders feature selection a daunting challenge. The task is to choose a minimal subset of templates that best capture the object shape, while being distinguishable from other non-objects. As noted above, the method starts from a subset of templates. The subsets are selected independently on some spatial bins. At a second stage, the method considers the joint feature statistics and uses discriminate analysis to determine the optimal feature set.
The template assumes no a priori semantical or geometrical content, and can be conceptually applied to any object with distinctive shapes.
A significant difference between the method according to the invention and techniques previously used is that the latter ones are all feature descriptors, while FEAT is more like a feature detector. The method uses greedy search to construct object-specific FEATs during training. In detection, the process localizes those features purposefully instead of relying on any generic detectors such as Difference Of Gaussian (DOG) or Harris corner. The feature can be combined with the well-established local descriptors for further discrimination.
Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.