Efficient access to multimedia database content requires the ability to search and organize multimedia information. In one form of traditional image retrieval, users have to provide examples of images that they are looking for. Similar images are found based on the match of image features. This retrieval paradigm is called Content-Based Image Retrieval (CBIR). In another type of retrieval system, images are associated with a description or metadata which is used as a surrogate for the image content.
It is noted that audio retrieval systems may also employ pattern recognition and content extraction. Typically, the content extraction operates on a direct semantic level, e.g., speech recognition, while source modeling and musical structure extraction may also be employed. Rarely, however, are implicit semantic concepts derived independently of the express semantic structures expressly presented in the audio sample.
Even though there have been many studies on CBIR, empirical studies have shown that using image features solely to find similar images is usually insufficient due to the notorious gap between low-level features and high-level semantic concepts (called semantic gap) [21]. In order to reduce this gap, region based features (describing object level features), instead of raw features of whole image, to represent the visual content of an image is proposed [5, 22, 7].
On the other hand, it is well-observed that often imagery does not exist in isolation; instead, typically there is rich collateral information co-existing with image data in many applications. Examples include the Web, many domain archived image databases (in which there are annotations to images), and even consumer photo collections. In order to further reduce the semantic gap, recently multi-modal approaches to image retrieval are proposed in the literature [25] to explicitly exploit the redundancy co-existing in the collateral information to the images. In addition to the improved retrieval accuracy, another benefit for the multimodal approaches is the added querying modalities. Users can query an image database either by image, or by a collateral information modality (e.g., text), or by any combination.
In addition to static image retrieval, proposals and systems have been developed to handle object identification, extraction, characterization, and segment retrieval in video programs and samples. In general, these systems directly extend the static image and audio techniques, although they may gain benefit of synchronization and association of audio and video data (and possibly closed caption text, if available). Likewise, analysis of temporal changes allows extraction of objects, analysis of object degrees of freedom, and motion planes within the signal. See, U.S. Pat. Nos. 6,850,252; 6,640,145; 6,418,424; 6,400,996; 6,081,750; 5,920,477; 5,903,454; 5,901,246; 5,875,108; 5,867,386; and 5,774,357, expressly incorporated herein by reference. Automated annotation of images and video content may be used in conjunction with MPEG-7 technologies.
Another use of object identification techniques is to permit efficient compression and/or model-based representation of objects within an image or video stream. Thus, especially in information loss-tolerant compression schemes, an image may be compressed in a vector quantized scheme by representing an object with a symbol (e.g., a word). The symbol, of course, may include a variety of parameters, for example describing scaling, translation, distortion, orientation, conformation, etc, as well as providing an error vector to establish deviance of the original image from the symbol (or model) representation.
A number of approaches have been proposed in the literature on automatic image annotation [1, 10, 11, 17]. Different models and machine learning techniques are developed to learn the correlation between image features and textual words from the examples of annotated images and then apply the learned correlation to predict words for unseen images. The co-occurrence model [19] collects the co-occurrence counts between words and image features and uses them to predict annotated words for images. Barnard and Duygulu et al [1, 10] improved the co-occurrence model by utilizing machine translation models. The models are correspondence extensions to Hofmann's hierarchical clustering aspect model [14, 15, 13], which incorporate multimodality information. The models consider image annotation as a process of translation from “visual language” to text and collect the co-occurrence information by the estimation of the translation probabilities. The correspondence between blobs and words are learned by using statistical translation models. As noted by the authors [1], the performance of the models is strongly affected by the quality of image segmentation. More sophisticated graphical models, such as Latent Dirichlet Allocator (LDA) [3] and correspondence LDA, have also been applied to the image annotation problem recently [2]. Another way to address automatic image annotation is to apply classification approaches. The classification approaches treat each annotated word (or each semantic category) as an independent class and create a different image classification model for every word (or category).
One representative work of these approaches is automatic linguistic indexing of pictures (ALIPS) [17]. In ALIPS, the training image set is assumed well classified and each category is modeled by using 2D multi-resolution hidden Markov models. The image annotation is based on nearest-neighbor classification and word occurrence counting, while the correspondence between the visual content and the annotation words is not exploited. In addition, the assumption made in ALIPS that the annotation words are semantically exclusive may not be necessarily valid in nature.
Recently, relevance language models [11] have been successfully applied to automatic image annotation. The essential idea is to first find annotated images that are similar to a test image and then use the words shared by the annotations of the similar images to annotate the test image.
One model in this category is Multiple-Bernoulli Relevance Model (MBRM) [11], which is based on the Continuous space Relevance Model (CRM) [16]. In MBRM, the word probabilities are estimated using a multiple Bernoulli model and the image block feature probabilities using a nonparametric kernel density estimate. The reported experiment shows that MBRM model outperforms the previous CRM model, which assumes that annotation words for any given image follow a multinomial distribution and applies image segmentation to obtain blobs for annotation.
It has been noted that in many cases both images and word-based documents are interesting to users' querying needs, such as in the Web search environment. In these scenarios, multi-modal image retrieval, i.e., leveraging the collected textual information to improve image retrieval and to enhance users' querying modalities, is proven to be very promising. Some studies have been reported on this problem.
Chang et al [6] applied a Bayes point machine to associate words and images to support multi-modal image retrieval. In [26], latent semantic indexing is used together with both textual and visual features to extract the underlying semantic structure of Web documents. Improvement of the retrieval performance is reported attributed to the synergy of both modalities. Recently, approaches using multi-modal information for Web image retrieval are emerging. In [23], an iterative similarity propagation approach is proposed to explore the inter-relationships between Web images and their textual annotations for image retrieval. The mutual reinforcement of similarities between different modalities is exploited, which boosts the Web image retrieval performance.
Appendix A provides a list of references relating to content-based image retrieval [CBIR], and latent semantic indexing, each of which is expressly incorporated herein by reference.