The following relates to the image storage and processing arts. It is described with example reference to classifying images based on the imaged subject matter or class, and is described with particular reference thereto. However, the following will find more general application in image classification, image content analysis, image archiving, image database management and searching, and so forth.
Widespread availability of digital cameras and other direct-digital imagers, and of optical scanners that convert film images, paper-printed images, or so forth into digital format, has led to generation of large numbers of digital images. Accordingly, there is interest in developing techniques for classifying images based on content, so as to facilitate image searches, image archiving, and like applications.
One approach that has been used is the “bag-of-words” concept derived from text document classification schemes. In text document bag-of-words classification schemes, clustering techniques are applied to group documents based on similarity in word usage. Such clustering techniques group together documents that share similar vocabularies as measured by word frequencies, word probabilities, or the like.
Extension of bag-of-words approaches to image classification requires an analog to the word vocabulary. In some approaches, a visual vocabulary is obtained by clustering low-level features extracted from training images, using for instance K-means. In other approaches, a probabilistic framework is employed, and it is assumed that there exists an underlying generative model such as a Gaussian Mixture Model (GMM). In this case, the visual vocabulary is estimated using the Expectation-Maximization (EM) algorithm. In either case, each word corresponds to a grouping of typical low-level features. It is hoped that each visual word corresponds to a mid-level image feature such as a type of object (e.g., ball or sphere, rod or shaft, or so forth), characteristic background (e.g., starlit sky, blue sky, grass field, or so forth).
Existing bag-of-words image classification schemes typically do not account for context of the visual words. For example, the visual word corresponding to a generally round sphere may be recognized, but its context is not recognized (e.g., whether it is in a blue sky suggestive that the sphere is the sun, or in a grass field suggestive that the sphere is a game ball, or so forth). Moreover, while it is hoped that each visual word corresponds to a mid-level feature, it may in fact correspond to a lower level feature (e.g., a word may correspond to a curved edge of particular orientation, rather than to an object such as a sphere). Again, recognition of context would be useful in making use of visual “words” that correspond to lower level features (e.g., words corresponding to several different curved edges, taken together by accounting for context, may be recognized as an assemblage representing a sphere).
Existing context-based visual classifiers have certain disadvantages. Typically, a set of contexts are identified as a kind of “context vocabulary”, in which each context is a geometrical arrangement or grouping of two or more visual words in an image. In some existing techniques, training images are analyzed to cluster contexts of words to define the context vocabulary, and the image classification entails identifying such clustered contexts in the image being classified. This approach works relatively well for well-structured objects such as bicycles, persons, and so forth. However, it does not work as well for more diffuse image components such as beach scenes, backgrounds, and so forth, because there is no single representative “context word” that well-represents such diffuse components.
Moreover, accounting for context in image classification is typically computationally intense. In a typical bag-of-words visual classification scheme, the number of words may number in the hundreds or thousands, and the image is analyzed respective to these hundreds or thousands of visual words. Incorporating context typically increases computational complexity in a multiplicative manner—for example, if the dictionary contains N words, then the total number of potential two-word contexts is N×N, and grows approximately exponentially for contexts of more words.