Image annotation generally involves assigning textual tags to visual images so that the content and style of the images may be described by the tags. It has important applications in image organization (e.g., image retrieval or image summarization) and understanding images (e.g., scene categorization, or object detection or recognition).
However, automatic image annotation is generally a challenging problem. In contrast to object recognition (e.g., recognizing objects depicted in images), the label space for image annotations may be much larger, and the distribution of tags may be highly unbalanced. Therefore, traditional approaches that involve training a classifier for each tag (e.g., label) are neither efficient nor effective. In addition, there may be a huge semantic gap between visual images and textual tags. Images with distinctly different appearances may correspond to the same tag, while visually similar images may be annotated with different tags. It is therefore not easy to model a mapping from visual representations to textual representations using a compact parametric model, especially when the images to be tagged depict diversified content. Moreover, new tags are continually being invented (e.g., from social media), and the semantic meaning of existing tags may evolve with time. How to incorporate new tags and new meanings with lowest cost is a practical consideration for an image annotation system.
Recently, image databases of unprecedentedly large scale have become available with the growth of social networks and photo sharing web sites. By exploiting rich collections of images and corresponding tags (e.g., labels) in these large databases, people have applied various non-parametric methods to infer knowledge about new images from similar ones in the database. Provided that there is a sufficiently large number of previously tagged images in the database, it is very likely there exists at least some images which are very similar to a given test image and have the same or very similar tags. In this way, the tags (e.g., label information) of the test image may be directly transferred from those similar images without using any parametric mapping functions. In addition, such non-parametric approaches may utilize very little image classifier training, which may greatly facilitate incorporation of new images and new tags. Such non-parametric approaches include scene parsing, image classification, and image annotation.
However, non-parametric methods typically achieve good performance only with a large database of tagged images. Consequently, the computational cost of matching a test image with all images in the database is very high. Also, large scale data sets constructed from web data usually contain noisy label information (e.g., from the presence of inaccurate tags) and have an unbalanced distribution (e.g., of tags), which further limits the performance of non-parametric methods.