The following relates to the object organization, retrieval, and storage arts. It particularly relates to image labeling, for predicting relevant terms from a given annotation vocabulary for an image.
For a variety of applications it is desirable to be able to classify an image based on its visual content. In some cases, the images are labeled manually. For example, on photo-sharing websites, viewers or authors of the images assign their own labels based on personal perception of the image content. In other cases, fully automatic systems are used where image labels are automatically predicted without any user interaction.
Most work on image annotation, object category recognition, and image categorization has focused on methods that deal with one label or object category at a time. The image can then be annotated with one or more labels corresponding to the most probable class(es). The function that scores images for a given label is obtained by means of various machine learning algorithms, such as binary support vector machines (SVM) classifiers using different (non-) linear kernels (J. Zhang, et al., “Local features and kernels for classification of texture and object categories: a comprehensive study,” IJCV, 73(2):213-238, 2007), nearest neighbor classifiers (M. Guillaumin, et al., “Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation,” ICCV, 2009), and ranking models trained for retrieval or annotation (I.D. Grangier, et al., “A discriminative kernel-based model to rank images from text queries,” PAMI, 30(8):1371-1384, 2008; J. Weston, et al. “Large scale image annotation: Learning to rank with joint word-image embeddings,” ECML, 2010).
A problem arises in classification when dealing with many classes, for example, when the aim is to assign a single label to an image from many possible ones, or when predicting the probability distribution over all labels for an image. Although, there are correlations in the binary classifier outputs, since the independent predictors use the same input images for prediction, the dependencies among the labels are generally not modeled explicitly.
For example in object class recognition, the presence of one class may suppress (or promote) the presence of another class that is negatively (or positively) correlated. In one study, the goal was to label the regions in a pre-segmented image with category labels (A. Rabinovich, et al., “Objects in context,” ICCV 2007. In that study, a fully-connected conditional random field model over the regions was used. In another study, contextual modeling was used to filter the windows reported by object detectors for several categories (C. Desai, et al., “Discriminative models for multi-class object layout,” ICCV, 2009); The contextual model of Desai includes terms for each pair of object windows that will suppress or favor spatial arrangements of the detections (e.g., boat above water is favored, but cow next to car is suppressed).
However none of the above methods takes into account the dependencies among the image labels explicitly.