Inage archives on the Internet are growing at a phenomenal rate. With digital cameras becoming increasingly affordable and the widespread use of home computers possessing hundreds of gigabytes of storage, individuals nowadays can easily build sizable personal digital photo collections. Photo sharing through the Internet has become a common practice. According to reports released recently, an Internet photo-sharing startup, flickr.com, has several million registered users and hosts several hundred million photos, with new photos in the order of one million being added on a daily basis. More specialized online photo-sharing communities, such as photo.net and airliners.net, also have databases in the order of millions of images entirely contributed by the users.
The Problem
Image search provided by major search engines such as Google, MSN, and Yahoo! relies on textual descriptions of images found on the Web pages containing the images and the file names of the images. These search engines do not analyze the pixel content of images and hence cannot be used to search unannotated image collections. Fully computerized or computer-assisted annotation of images by words is a crucial technology to ensure the “visibility” of images on the Internet, due to the complex and fragmented nature of the networked communities.
Example pictures from the Website flickr.com. User-supplied tags: (a) ‘dahlia’, ‘golden’, ‘gate’, ‘park’, ‘flower’, and ‘fog’; (b) ‘cameraphone’, ‘animal’, ‘dog,’, and ‘tyson’.
Although owners of digital images can be requested to provide some descriptive words when depositing the images, the annotation tends to be highly subjective. Take an example of the pictures shown in FIG. 1. The users on flickr.com annotated the first picture by the tags ‘dahlia’, ‘golden’, ‘gate’, ‘park’, ‘flower’, and ‘fog’ and the second picture by ‘cameraphone’, ‘animal’, ‘dog’, and ‘tyson’. While the first picture was taken at the Golden Gate Park near San Francisco according to the photographer, this set of annotation words can be a problem because this picture may show up when other users are searching for images of gates. The second picture may show up when users search for photos of various camera phones.
A computerized system that accurately suggests annotation tags to users can be very useful. If a user is too busy, he or she can simply check off those relevant words and type in other words. The system can also allow trained personnel to check the words with the image content at the time a text-based query is processed. However, automatic annotation or tagging of images with a large number of concepts is extremely challenging, a major reason that real-world applications have not appeared.
Human beings use a lot of background knowledge when we interpret an image. With the endowed capability of imagination, we can often see what is not captured in the image itself. For example, when we look at the picture in FIG. 2A, we know it is a race car although only a small portion of the car is shown. We can imagine in our mind the race car in three dimensions. If an individual has never seen a car or been told about cars in the past, he is unlikely to understand what this picture is about, even if he has the ability to imagine. Based on the shining paint and the color of the rubber tire, we can conclude that the race car is of very high quality. Similarly, we realize that the girl in FIG. 2B is spinning based on the perceived movements with respect to the background grass land and her posture. Human beings are not always correct in image interpretation. For example, a nice toy race car may generate the same photograph as in FIG. 2A. Computer graphics techniques can also produce a picture just like that.
Without a doubt, it is very difficult, if at all possible, to empower computers with the capability of imagining what is absent in a picture. However, we can potentially train computers by examples to recognize certain concepts. Such training techniques are valuable for annotating not only photographic images taken by home digital cameras but also the ever increasing digital images in scientific research experiments. In biomedicine, for instance, modem imaging technologies reveal to us tissues and portions of our body in finer and finer details, and with different modalities. With the vast amount of image data we generate, it has become a serious problem to examine all the data manually. Statistical or machine learning based technologies can potentially allow computers to screen such images before scientists spend their precious time on them.
Prior Related Work
The problem of automatic image annotation is closely related to that of content-based image retrieval. Since the early 1990s, numerous approaches, both from academia and the industry, have been proposed to index images using numerical features automatically-extracted from the images. Smith and Chang developed of a Web image retrieval system. In 2000, Smeulders et al. published a comprehensive survey of the field. Progresses made in the field after 2000 is documented in a recent survey article. We review here some work closely related to ours. The references listed below are to be taken as examples only. Readers are urged to refer to survey articles for more complete references of the field.
Some initial efforts have recently been devoted to automatically annotating pictures, leveraging decades of research in computer vision, image understanding, image processing, and statistical learning, Generative modeling, statistical boosting, visual templates, Support Vector Machines, multiple instance learning, active learning, latent space models, spatial context models, feedback learning and manifold learning have been applied to image classification, annotation, and retrieval.
Our work is closely related to generative modeling approaches. In 2002, we developed the ALIP annotation system by profiling categories of images using the 2-D Multiresolution Hidden Markov Model (MHMM). Images in every category focus on a semantic theme and are described collectively by several words, e.g., “sail, boat, ocean” and “vineyard, plant, food, grape”. A category of images is consequently referred to as a semantic concept. That is, a concept in our system is described by a set of annotation words. In our experiments, the term concept can be interchangeable with the term category (or class). To annotate a new image, its likelihood under the profiling model of each concept is computed. Descriptive words for top concepts ranked according to likelihoods are pooled and passed through a selection procedure to yield the final annotation. If the layer of word selection is omitted, ALIP essentially conducts multiple classification, where the classes are hundreds of semantic concepts.
Classifying images into a large number of categories has also been explored recently by Clien et al. for the purpose of pure classification and Carneiro et al. for annotation using multiple instance learning. Barnard et al. aimed at modeling the relationship between segmented regions in images and annotation words. A generative model for producing image segments and words is built based on individually annotated images. Given a segmented image, words are ranked and chosen according to their posterior probabilities under the estimated model. Several forms of the generative model were experimented with and compared against each other.
The early research has not investigated real-time automatic annotation of images with a vocabulary of several hundred words. For example, as reported, the system takes about 15-20 minutes to annotate an image on a 1.7 GHz Intel-based processor, prohibiting its deployment in the real-world for Web-scale image annotation applications. Existing systems also lack performance evaluation in real-world deployment, leaving the practical potential of automatic annotation largely unaddressed. In fact, most systems have been tested using images in the same collection as the training images, resulting in bias in evaluation. In addition, because direct measurement of annotation accuracy involves labor intensive examination, substitutive quantities related to accuracy have often been used instead.