The exemplary embodiment relates to learning classifiers and finds particular application in the classification of samples such as images which allows new data (samples or classes) to be added at low cost.
There has been a substantial increase recently in the number of digital items that are available, such as single images and videos. These exist, for example, in broadcasting archives and social media sharing websites. Only a small fraction of these items is consistently annotated with labels which represent the content of the item, such as the objects which are recognizable within an image. Accordingly, scalable methods are desired for annotation and retrieval to enable efficient access to this large volume of data. One dataset (see, Deng, et al., “ImageNet: A large-scale hierarchical image database.” in CVPR (2009)), which contains more than 14 million images manually labeled according to 22,000 classes, has provided a valuable benchmark tool for evaluating large-scale image classification and annotation methods.
In large-scale image annotation, for example, the goal is to assign automatically a set of relevant labels to an image, such as names of objects appearing in the image, from a predefined set of labels. The general approach is to treat the assignment as a classification problem, where each label may be associated with a respective classifier which outputs a probability for the class label, given a representation of the image, such as a multidimensional vector. To ensure scalability, linear classifiers such as linear support vector machines (SVMs) are often used, sometimes in combination with dimension reduction techniques which reduce the dimensionality of the input multidimensional vector, to speed-up the classification. Systems have been developed which are able to label images with labels corresponding to 10,000 or more classes (see, for example, Deng, J., et al., “What does classifying more than 10,000 image categories tell us?” in ECCV (2010), hereinafter, “Deng 2010”; Weston, J., et al., “Scaling up to large vocabulary image annotation,” in IJCAI (2011) hereinafter, “Weston”; and Sánchez, J., et al., “High-dimensional signature compression for large-scale image classification,” in CVPR (2011)).
A drawback of these methods, however, is that when images of new categories (classes) become available, new classifiers have to be trained at a relatively high computational cost. Many real-life large-scale datasets are open-ended and dynamic. This means that new potential classes appear over time and new photos/videos continuously appear, which are to be added to existing or new classes.
One method which has been adapted to large scale classification is referred to as k-nearest neighbor (k-NN) classification. In this approach, each image in a database is represented by a multidimensional feature vector and labeled with one (or more) of a set of classes. When a new image to be labeled is presented, a representation is computed. The image representation is compared with the representations of the images in the database using a suitable distance measure, to identify the nearest images, i.e., the k-NN, where k can be a suitable number such as 1, 5, or 10. The labels of the retrieved images are used to assign a class label (or probabilistic assignment of labels) to the new image. This highly non-linear and non-parametric classifier has shown good performance for image annotation, when compared with SVMs (see, Deng 2010; Weston; and Guillaumin, M, et al. “Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation,” in ICCV (2009)).
One disadvantage of the k-NN method is that the search for nearest neighbors for classification of the new image is computationally demanding for large and high-dimensional datasets. Each time a new image is received, its representation has to be compared with all the image representations in the database. While methods may be employed which limit the search to only a subset of the images, this tends to reduce the performance of the method.
Another approach for addressing the classification of evolving datasets is the Nearest Class Mean (NCM) classifier. In this approach, each class is represented by its mean feature vector, i.e., the mean of all the feature vectors of the images in the database that are labeled with that class (see, Webb, A., “Statistical Pattern Recognition,” Wiley (2002); Veenman, C., et al. “LESS: a model-based classifier for sparse subspaces. IEEE Trans. PAMI 27, pp. 1496-1500 (2005); and Zhou, X., et al., “Sift-bag kernel for video event analysis,” in ACM Multimedia (2008)). When a new image is to be labeled, its own representative feature vector is compared with the mean feature vectors of each of the classes using a suitable distance measure. The label or labels assigned to the image are based on the computed distances. The cost of computing the mean for each class is low, with respect to the cost of feature extraction, and this operation does not require accessing images of other classes. In contrast to the k-NN classifier, the NCM classifier is a linear classifier which leads to efficient classification.
One disadvantage of this method is that the complete distribution of the training data of a class is characterized only by its mean. In practice, the performance of such classifiers on large datasets tends to be low.
Aspects of the exemplary method provide a system and method of learning and applying a classifier for labeling images and other samples, which is well suited to large and evolving datasets while being computationally efficient.