The following relates to the image storage and processing arts. It is described with example reference to classifying images based on the imaged subject matter or class, and is described with particular reference thereto. However, the following will find more general application in image classification, image content analysis, image archiving, image database management and searching, and so forth.
Widespread availability of digital cameras and other direct-digital imagers, and of optical scanners that convert film images, paper-printed images, or so forth into digital format, has led to generation of large numbers of digital images. Accordingly, there is interest in developing techniques for classifying images based on content, so as to facilitate image searches, image archiving, and like applications.
Techniques exist for classifying textual documents based on content. For example, clustering techniques can be applied to group documents based on similarity in word usage. Such clustering techniques in effect group together documents that share similar vocabularies as measured byword frequencies, word probabilities, or the like. These clustering-based techniques have been extended to image clustering.
However, a difficulty arises in that images are not composed of “words” that readily form a vocabulary. To address this problem, it is known to define regions, sometimes called key patches, that contain features of interest. For example, if the imaging subjects are animals or people, the key patches may focus on facial aspects such as eyes, nose, and mouth, gross anatomical aspects such as hands, feet, paws, limbs, and torso regions, and so forth. Each key patch image region is analyzed to determine a features vector or other features-based representation, which quantifies features such as spatial frequency characteristics, average intensity, and so forth. This process is repeated for each image in a set of labeled training images to produce a set of feature vectors corresponding to the key patches. The feature vectors are clustered, and the feature vectors in each cluster are averaged or otherwise statistically combined to generate visual words of a visual vocabulary. An image classifier is then trained using the training images, such that the image classifier substantially accurately classifies the training images (respective to image class labels assigned to the training images) based on comparison of feature vectors extracted from key patches of the image with the visual vocabulary. The trained classifier is then usable to classify other input images which do not have pre-assigned class labels.
Such image classification approaches advantageously leverage classification techniques developed for classifying text documents. However, computational scaling difficulties are encountered when classifying images. The skilled artisan recognizes that image processing is substantially more computationally intensive than textual processing. For example, identifying key patches in an input image involves performing pattern recognition of portions of the image, preferably including allowances for rotation, isotropic expansion or contraction, anisotropic expansion or contraction (such as image stretching), differences in overall intensity, and other variations that are typically observed from image to image. In contrast, the corresponding operation in text document classification is the word search, which is computationally straightforward.
The computational time for classifying an image typically scales approximately with the product C×N, where C is the number of classes and N is the number of visual words in the vocabulary. As the number of image classes (C) is increased, the size of the visual vocabulary (N) sufficient to accurately classify images typically also increases. In some image classifiers, for example, it has been found that a visual vocabulary of over one-thousand visual words is needed to provide an accuracy of above 60% in classifying images into one of fifteen classes. Because N generally increases with C, the computational time typically scales superlinearly with the number of classes (C). Thus, as the number of image classes increases, the increase in computational complexity can become prohibitive.