Automatic image annotation is a focal problem in image processing and computer vision. Annotation systems can be developed using generative modeling [2], support vector machines, visual templates, latent space models, and more recently through joint word-image embedding and kernel learning. A majority of techniques depend on pre-selected training images and invest many hours to collect them.
In recent years, easy access to loosely labeled Web images has greatly simplified training data selection. Search engines retrieve potential training examples by comparing concept names with image labels (user-assigned tags or surrounding-text keywords). In this context, a concept is illustrated by all images labeled with the concept name and an image with multiple labels exemplifies co-occurring concepts. The retrieved images could be directly used to train annotation systems, except that they are often irrelevant from a machine learning perspective. FIG. 1 shows noisy images associated with the concept castle. As many as 85% of Web images can be incorrectly labeled. Even user-assigned tags are highly subjective and about 50% have no relation to visual content. Tags appear in no particular order of relevance and the most relevant tag occurs in top position in less than 10% of the images. Consequently, several strategies have been proposed to refine retrieved collections.
ImageNet is a crowd-sourcing initiative to manually validate retrieved images. This process results in few errors, but takes years to gather sufficient data for a large concept vocabulary. Algorithmic training data selection provides a necessary trade-off between efficient automation and selection accuracy wherein potentially noisy examples are filtered using statistical learning techniques. Noise mitigation may be posed as a classification problem where a support vector machine (SVM) is trained to distinguish images tagged with a specific concept from those not tagged with that concept. Alternately, a relevance ranking problem can be formulated where images are ranked in the order of SVM classification margin or other statistical measures. For example, unsupervised clustering is useful to learn a concept-specific static distribution of data and rank images in the order of the chosen cluster measure (mixture likelihood or distance from the nearest prototype). Top ranked images can be used to train annotation systems and low ranked images are discarded as noise.
The problem of automatic training data selection is similar to statistical outlier rejection which works on the general assumption that outliers are sparse and distinguishable from the ‘normal’ data represented by a statistical reference model. The high level of noise associated with user-tagged images grossly violates this assumption.
To illustrate this problem, we created a simplified two-dimensional visualization of 647 Flickr images tagged with a specific concept. FIG. 2 shows several training data selection scenarios using heat-maps, where the color of each point can be mapped to a numeric relevance score using the associated color scale. FIG. 2A depicts the selection of all user-tagged images assuming reliability of tags, an assumption that completely breaks down when compared with the manual relevance assessment in FIG. 2B. In this particular example, nearly 34% of images are noisy, highlighting the fact that noise need not be sparse or separable1. 1 The outlier inseparability presents an interesting perspective for manual training data selection. Even if manual selection filters out all noisy images, subsequent statistical image annotation algorithms may continue to mistake similar images for relevant examples, especially in the high-density region of feature space—a classic outcome of the semantic gap.
Support vector machines and K-Means clustering do not specifically account for noise in statistical reference learning. To apply classification-based SVM, an additional collection of images not tagged with the target concept is collected as the negative class. For SVM classifier to be effective, it is imperative that the chosen negative examples match the noisy positive examples or else the classifier may overfit the noise. FIG. 2C shows the SVM scores based on classification margin.
Given its computational efficiency and simple implementation, K-Means is commonly used to select training examples based on the proximity of an image from the nearest cluster prototype. FIG. 2D shows the output of K-Means algorithm seeded with 20 clusters in K-Center initialization where even the noisy examples get a high score due to outlying clusters. A robust ranking cannot be guaranteed due to the sensitivity to outliers and initialization conditions.