The following relates to the image processing arts, image retrieval arts, image archiving arts, and related arts.
Automated tagging or classification of images is useful for diverse applications such as image archiving, image retrieval, and so forth. In a typical approach, a number of image key points or patches are selected across the image. Each key patch is quantified by a features vector having elements quantitatively representing aspects of the key patch. These feature vectors are then used as inputs to a trained classifier that outputs a class label (or vector of class label probabilities, in the case of a soft classifier) for the image.
A problem with this global approach is that it is not well-suited to tagging images containing multiple subjects. For example, an image showing a person may be accurately tagged with the “person” class, while an image showing a dog may be accurately tagged with the “dog” class—but a single image showing a person walking a dog is less likely to be accurately classified.
A known approach for such problems is to segment the image into smaller regions, and to classify the regions separately. Since the size of the subject or subjects shown in the image is not known, the segmentation into regions may employ different scales of region size. Since the different region sizes have different numbers of pixels, it is also known to use different classifiers for the different region sizes, for example an image-scale classifier, a patch-scale classifier (operative for an image region containing a single patch), and additional “mid-scale” classifiers for various intermediate region sizes.
Such segmentation approaches have numerous deficiencies. First, there is no basis for knowing a priori which region size is best for classifying a given subject in an image. For example, in the aforementioned example of an image of a person walking a dog one might suspect that the optimal region size for classifying the dog is the region size just encompassing the dog in the image. But, if it turns out that the dog's snout is the most “characteristic” feature of the dog (for example, possibly the feature that best distinguishes images of dogs from images of cats) then the optimal region size might be the region size that just encompasses the dog's snout.
Moreover, some correlations between classifications of overlapping or contained regions of different scales might be expected. For example, the image region encompassing the dog may be (correctly) classified as “dog” while the smaller-scale regions that make up the region encompassing the dog might be (erroneously) misclassified as something other than “dog”. In some cases, a correlation may be found in which this pattern of a larger region classifying as “dog” and its constituent smaller regions not classifying as “dog” may itself be characteristic of images of dogs. Existing image region classifiers do not provide any principled way to identify and utilize such correlations between image regions of different size scales.