With the advent of digital imagery, the number of digital images has been growing rapidly and there is an increasing requirement to index and search these images effectively. Systems using non-textual (image) queries have been proposed but many users found it hard to represent their queries using abstract image features. Most users prefer textual queries, i.e. keyword-based image search, which is typically achieved by manually providing image annotations and allowing searches over these annotations using a textual query. However, manual annotation is an expensive and tedious procedure, making automatic image annotation necessary for efficient image retrieval.
Image annotation has been an active research topic in recent years due to its potential impact on both image understanding and web image retrieval. Existing relevance-model-based methods perform image annotation by maximizing the joint probability of images and words, which is calculated as an expectation projected over training images. However, the semantic gap and the dependence on training data restrict their performance and scalability.
Many algorithms have been proposed for automatic image annotation. In a straightforward way, each semantic keyword or concept is treated as an independent class and corresponds to one classifier. Methods like linguistic indexing of pictures, image annotation using support vector machine (SVM) and Bayes point machine fall into this category. Some other methods try to learn a relevance model associating images and keywords. The early work applied a machine translation model to translate a set of blob tokens (obtained by clustering image regions) to a set of keywords.
Other work introduced the Cross-Media Relevance Model (CMRM), which uses the keywords shared by the similar images to annotate new images. The CMRM has been subsequently improved by the continuous-space relevance model (CRM) and the multiple Bernoulli relevance model (MBRM). Recently, there are some efforts to consider the word correlation in the annotation process, such as Coherent Language Model (CLM), Correlated Label Propagation (CLP), and WordNet-based method (WNM).
All above-discussed methods suffer from two problems. One is their dependence on the training dataset to learn the models. In practice, it is very difficult to get a well-annotated set, and their scalability is doomed. The other is the well-known semantic gap. With traditional simple associations between images (visual content features) and words, the degradation of annotation performance is unavoidable.
The web prosperity brings a huge deposit of almost all kinds of data and provides solutions to many problems that used to seem unsolvable. In recent years, some researchers began to leverage web-scale data for image annotation. An example of such work was proposed by Wang et al. (Wang, X., Zhang, L., Jing, F., Ma, W. Y. AnnoSearch: Image Auto-Annotation by Search. International Conference on Computer Vision and Pattern Recognition, New York, USA, June, 2006.) In that work, at least one accurate keyword is required by the text-based image searcher to find a set of semantically similar images. Content-based image search in the obtained image set is then performed to retrieve visually similar images. Annotations are then mined from the text descriptions (title, URLs and surrounding texts) of the retrieved images, which are similar on semantics and visual content. However, the initial accurate keyword for each image is a cumbersome requirement in practice. Moreover, the method needs to perform the content-based search on a well built image set, which is not easy to construct and not readily accessible. Additionally, there is currently no ideal content-based search engine available. The method proposed in that article thus has only limited use for the web image annotation.
For the foregoing reasons, further improvement on methods for image annotation is desirable, particularly in the Web search-related context.