The following relates to the image classification arts, object classification arts, and so forth.
In image classification, the image is typically converted to a quantitative representation that embodies image characteristics of interest and which is then compared with corresponding quantitative representations of other images, and/or processed using a classifier trained to process the quantitative representation. Image classifiers have applications such as indexing of images, retrieval of images, or so forth.
The quantitative image representation is sometimes derived from so-called “patches”. In this approach, the image is divided into regions, or patches, a representative vector is constructed for each patch, and these results are concatenated or otherwise combined to generate the quantitative representation. The use of patches distributed across the image ensures that the quantitative representation is representative of the image as a whole while also including components representing smaller portions of the image. In some approaches, the patches may be of varying sizes and may overlap in order to provide components representing different objects (e.g., faces, animals, sky, or so forth) having different size scales and located at different, and possibly overlapping, places in the image.
One approach that has been used is the “bag-of-visual-words” (BOV) concept derived from text document classification schemes. This analogizes to the classifying of text documents, in which a “bag of words” is suitably a vector or other representation whose components indicate occurrence frequencies (or counts) of words in the text document. The bag of words is then a suitable text document representation for classification tasks. To extend this approach to image classification, a “visual” vocabulary is created. In some approaches, a visual vocabulary is obtained by clustering low-level features extracted from patches of training images, using for instance K-means. In other approaches, a probabilistic framework is employed, and it is assumed that there exists an underlying generative model such as a Gaussian Mixture Model (GMM), and the visual vocabulary is estimated for instance using Expectation-Maximization (EM). In such “bag of visual words” representations, each “word” corresponds to a grouping of typical low-level features. Depending on the training images from which the vocabulary is derived, the visual words may, for instance, correspond to image features such as objects, or characteristic types of background, or so forth.
The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches for image classification. In the context of image classification the FK has been used to extend the bag-of-visual-words (BOV) representation by going beyond count statistics. See, e.g. Perronnin et al., “Fisher kernels on visual vocabularies for image categorization” in CVPR (2007) which is incorporated herein by reference in its entirety.
For further reference, the following U.S. patents and published U.S. patent applications are referenced, and each of the following patents/publications is incorporated herein by reference in its entirety: Perronnin, U.S. Pub. No. 2008/0069456 A1 published Mar. 20, 2008 and titled “Bags of visual context-dependent words for generic visual categorization”; Liu et al., U.S. Pub. No. 2009/0144033 A1 published Jun. 4, 2009 and titled “Object comparison, retrieval, and categorization methods and apparatuses”; Csurka et al., U.S. Pub. No. 2010/0040285 A1 published Feb. 18, 2010 and titled “System and method for object class localization and semantic class based image segmentation”; Perronnin et al., U.S. Pub. No. 2010/0092084 A1 published Apr. 15, 2010 and titled “Representing documents with runlength histograms”; Perronnin et al., U.S. Pub. No. 2010/0098343 A1 published Apr. 22, 2010 and titled “Modeling images as mixtures of image models”; Perronnin et al., U.S. Pub. No. 2010/0191743 A1 published Jul. 29, 2010 and titled “Contextual similarity measures for objects and retrieval, classification, and clustering using same”; de Campos et al., U.S. Pub. No. 2010/0189354 A1 titled “Modeling images as sets of weighted features”; Perronnin, U.S. Pat. No. 7,756,341 issued Jul. 13, 2010 and titled “Generic visual categorization method and system”; and Perronnin, U.S. Pat. No. 7,680,341 issued Mar. 16, 2010 and titled “Generic visual classification with gradient components-based dimensionality enhancement.”
The following sets forth improved methods and apparatuses.