1. Field of the Invention
The invention disclosed and claimed herein generally pertains to a method and apparatus for enriching a narrative or other descriptive message, by acquiring pertinent multimedia objects or artifacts for presentation with the narrative. More particularly, the invention pertains to a method of the above type wherein objects of multimedia content can be searched out and retrieved for presentation with different segments or portions of the narrative. Even more particularly, the invention pertains to a method of the above type that can readily be used by persons without special training or equipment, in order to enrich virtually any narrative or other descriptive message with diverse multimedia content.
2. Description of the Related Art
Different media, such as text, audio, image, and video, are used to communicate messages, ideas and concepts in computer-based communications. Currently, the majority of the communicated information is uni-modal. For example, a blog post, an e-mail message or a news article is a piece of information in a text-only format, whereas a collection of vacation photos only portrays the information about the vacation in image form. However, certain media are most suitable for communicating certain concepts. For example, seeing an image of a “sunset” is more informative than a description of that concept in text form. Ideally, one should be able to compose a message or enrich an already existing message in one medium, partially or totally, with snippets of other, alternative media that illustrate the concepts in the message with their manifestations. The multimedia enhanced message could improve the user's perception of the message, or could transform the message into a form more amenable for communication in a given context.
The proliferation of multimedia content, in various application domains, provides rich repositories of media snippets. However, present systems and approaches, which might be used to access media content for message enrichment, tend to be uni-modal. Accordingly, such approaches are concerned with only a single type of content media. For example, G. Grefenstette and P. Tapainen, in “What is a word? What is a sentence? Problems of Tokenization”, Proceedings of the 3rd International Conference on Computational Lexicography (COMPLEX'94), Budapest, Hungary 1994, provide techniques for parsing text messages and extracting tokens from them. M. Campbell, S. Ebadollahi, M. Naphade, A. P. Natsev, J. R. Smith, J. Tesic, L. Xie, and A. Haubold, “IBM research TRECVID-2006 video retrieval system,” in NIST TRECVID Workshop, (Gaithersburg, Md.), November 2006) provide a system for parsing videos into their constituent elements of temporal structure which are then condensed into a single image (key-frame).
Systems such as JURU use tokens of the above type, obtained from parsing text documents, as queries to retrieve textual content from repositories. In IBM Multimedia Analysis and Retrieval System, features derived from images or structural elements of the above type are used to form queries that can be posed against a repository of images and videos. Text search methods such as those of Y. Maarek and F. Smadja, “Full Text Indexing Based on Lexical Relations: An Application: Software Libraries”, in Proceedings of 12th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1989, and its JAVA implementation JURU, disclose search repositories of pre-indexed text documents. The IBM Multimedia Analysis and Retrieval System obtains the best matches in a repository of images and videos based on the closeness of the feature representation of those artifacts. Thus, each of these systems is directed only to objects of a single media, such as text, video or images.
Moreover, given a collection of media objects, there are a variety of systems for stitching them together in order to compose a single document. Once again, however, such systems tend to be uni-modal in scope. For example in A. Girgensohn, F. Shipman, L. Wilcox, “Hyper-Hitchcock: Authoring Interactive Videos and Generating Interactive Summaries”, in Proceedings of 11th ACM International Conference on Multimedia, 2003, a system is presented that aids the user in editing multiple video objects to form a single video. In Xian-Sheng HUA, Lie LU, Hong-Jiang ZHANG, “Optimization-based automated home video editing system”, in IEEE Transactions in Circuits and Systems for Video Technology, Volume: 14, Issue: 5, pages: 572-583, May 2004, a system was made to automatically extract segments of a video and concatenate them for summarization of home videos. Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. Inferring Strategies for Sentence ordering in Multidocument News Summarization. Journal of Artificial Intelligence Research, 17:35-55, 2002, provides a methodology for automatically summarizing and compiling a composite text document using only a multiplicity of other text documents.
It is thus seen that all the above mentioned systems and approaches provide an answer to only a single component of a system required for multimedia narrative enrichment. An end-to-end system and method, which accepts a narration as input and coordinates all necessary tasks to generate a coherent multimedia enriched narrative as an output, does not presently exist. Such necessary tasks would include acquisition of multimedia objects or artifacts for different portions of the narration, and then assembling or composing the respective portions and multimedia artifacts into a coherent multimedia enriched narrative.