Advances in video and computing technologies are enabling the production of increasing amounts of multimedia information in the form of broadcast radio and television programs, video recordings, multimedia software, Web pages, and the like. The now widespread popularity of broadcast news services such as the Cable News Network (CNN), news documentary programs such as produced by the Public Broadcasting System (PBS), and computerized interactive media such as the Web pages provided by USA Today, have made it clear that the value of multimedia information has expanded beyond transient communication and entertainment to become a serious form of business communication, advertisement, historical record and research tool.
Researchers have long used content summaries as a valuable way to locate source material. In the case of printed media such as books or magazines, it is fairly simple to create an index or abstracts of contents at the time that the source material is edited or typeset. The indexes, abstracts, or even the full text can then be maintained in the form of computer files; indeed, in most cases, the source information now originates as word processing files. Standard data base software can then be used to search for information of interest, and there are many different examples of such systems in commercial operation at the present time. Some of these systems exploit techniques to automatically summarize text. These include simply selecting a predetermined number of the initial words, measuring word-frequency distributions, detecting clue phrases (e.g., “in summary,” “the most important”), or reasoning about the discourse and/or rhetorical structure.
Just as content abstraction techniques have been found to be quite valuable for managing printed text materials, multimedia sources would ideally be available with similar facilities to support real-time profiling as well as retrospective search. Unfortunately, unlike printed media, the innate continuous and visual nature of multimedia information makes it relatively difficult to catalogue. At the present time, it is common to employ data analysts to manually view, search, extract, transcribe, summarize, visualize, annotate, and segment multimedia information sources for a subject of interest or discovery of trends. These manual annotation techniques suffer from problems of accuracy, consistency (when performed by more than one operator), timeliness, scalability, and cost.
Certain techniques have been developed to automatically create time indices based on visual transitions in the multimedia image stream (e.g., dissolve, fade, cut) and shot classification (e.g., news anchor shot versus story shots). See, for example, Zhang, H., et al., in “Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution,” Proceedings of ACM Multimedia '95, pp. 15-24. Certain other researchers have specifically investigated the linguistic aspects of video such as closed captioned text or transcripts, and provided for the indexing of key words with associated key video frames to create a static and/or hypertext depiction of television news. See Shahraray, B., et al., in “Automated Authoring of Hypermedia Documents of Video Programs,” Proceedings of ACM Multimedia '95, pp. 401-409, and Liang, Y., et al., in “A Practical Video Database Based on Language and Image Analysis,” AAAI Spring Symposium, 1997, pp. 127-132. More complex linguistic processing is reported in Taniguchi, et al., “An Intuitive and Efficient Access Interface to Real-Time Incoming Video Based on Automatic Indexing,” Proceedings of ACM Multimedia '95, pp. 25-34, who use Japanese topic markers such as “ex ni tsuite” and “wa” (“with regard to”, “as for”), as subject key word markers. Unfortunately, such key word indices, even when supplemented with linguistic processing to address complexities such as synonymy, polysemy, and co-reference, typically only support more traditional search and retrieval tasks. Brown, M. G., et al., in “Automatic Content Based Retrieval of Broadcast News,” Proceedings of ACM Multimedia '95, pp. 35-44, provide content-based access to video using a large scale, continuous speech recognition system to transcribe the associated audio. And, Hauptmann, A., et al., in “Informedia: News on Demand Multimedia Information Acquisition and Retrieval,” Intelligent Multimedia Information Retrieval, (Cambridge, Mass.: AAAI Press, 1997), pp. 215-239, perform a series of analyses including color histograms, optical flow analysis, and speech transcription.