The representation of knowledge is important for developing computerized systems for applications of decision support, information discovery and retrieval, and robotics and artificial intelligence. In the most general sense, real world knowledge can be represented by describing semantic concepts and their relations as well as the rules for manipulating them. With the availability of audio and visual content acquisition devices, multimedia content, which is understood as content in the form of images, video, audio, graphics, text, and any combination thereof, can find an increasing role in representing real world knowledge.
Audio-visual content is typically formed from the projection of real world entities through an acquisition process involving cameras and other recording devices. In this regard, audio-visual content acquisition is comparable to the capturing of the real world by human senses. This provides a direct correspondence of human audio and visual perception with the audio-visual content. On the other hand, text or words in a language can be thought of as symbols for the real world entities. The mapping from the content level to the symbolic level by computer is quite limited and far from reaching human performance. As a result, in order to deal effectively with audio-visual material, it is necessary to model real world concepts and their relationships at both the symbolic and perceptual levels by developing explicit representations of this knowledge.
Multimedia knowledge representations have been found to be useful for aiding information retrieval. The multimedia thesaurus (MMT) demonstrated by R. Tansley, C. Bird, W. Hall, P. Lewis, and M. Weal in an article entitled “Automating the Linking of Content and Concept”, published in the Proc. of ACM Multimedia, Oct. 30–Nov. 4, 2000, is used in the MAVIS information retrieval system described by D. W. Joyce, P. H. Lewis, R. H. Tansley, M. R. Dobie, and W. Hall in an article entitled “Semiotics and Agents for Integrating and Navigating Through Media Representations of Concepts,” published in Proc. of Conference on Storage and Retrieval for Media Databases 2000, (IS&T/SPIE-2000), Vol. 3972, pp.120–31, January 2000. The MMT consists of a network of concepts that are connected by traditional thesaurus relationships and sets of multimedia content that are used as signifiers of the concepts. The applications of MMT include expanding or augmenting queries in which a query might consist of a textual term such as “car”, and the representation of that and narrower concepts are used to retrieve images of cars. However, the MMT does not address the aspect in which the perceptual relationships such as feature similarity among the multimedia signifiers contribute additional relations to the multimedia knowledge representation.
Alternatively, visual thesauri, which have been found to be useful for searching image databases, describe the similarity relationships between features of the multimedia content. W. Y. Ma and B. S. Manjunath in an article entitled “A Texture Thesaurus for Browsing Large Aerial Photographs,” published in the Journal of the American Society for Information Science (JASIS), pp. 633–648, vol. 49, No. 7, May 1998, described the use of a texture thesaurus, which encodes different types of textures and their similarity, for browsing an image database on the basis of the textural features of the images. However, while the texture thesaurus addresses the perceptual relationships among the textures, it does not address the aspect in which the textures are used as signifiers for concepts, nor does it associate other symbols such as words with the concepts.
The use of relationships among words can be exploited for image retrieval as taught by Y. Alp Aslandogan, C. Thier, C. T. Yu, J. Zon, N. Rishe in the paper entitled “Using Semantic Contents and WordNet in Image Retrieval,” published in Proc. of the 20th International ACM SIGIR Conference on Research and Development in information Retrieval, pp. 286–295, 1997. The system allows the similarity searching ot images based on the semantic entity-relationship-attribite descriptions of the image content. WORDNET is used for expanding the query or database for matching. WORDNET is a registered trademark of Trustees of Princeton University, Princeton, New Jersey. The WORDNET system, taught by G. A. Miller in an article entitled “WordNet: A Lexical Database for English,” published in Communication of the ACM, Vol. 38, No. 11, pp. 39–41, Nov. 1995, incorporated herein by reference, is a graphical network of concepts and associated words in which the relationships among concepts are governed by the form and meaning of the words. However, WORDNET and other textual representations of knowledge do not sufficiently address the audio-visual and perceptual aspects of the concepts they model. As a result, they have limited use for searching, browsing, or summarizing multimedia information repositories.