There exists a large number of applications which employ multimedia information. However, it cumbersome for users and applications to effectively manipulate multimedia information due to the nature of multimedia information. Multimedia information is often stored in digital data files. These files require a large amount of storage, making manipulation of the multimedia information by applications computationally expensive. If the digital data file is stored on a network, access to the digital data file by applications is hindered by limitations on network bandwidth.
In addition to the difficulties presented to applications by multimedia information, users are also challenged by multimedia information. Multimedia information, such as motion pictures or music, is time dependent media. Because it is time dependent, it is often not practical for users to audit an entire work. For example, if a motion picture search engine returns many results, each lasting 90 minutes or more, the user will not have time to investigate each result. In another example, a music e-commerce website may offer music for prospective buyers to audition. It is burdensome for users to listen to an entire song in order to determine whether they like it. Additionally, by providing users with access to complete songs, the website operator has essentially given away its merchandise and discouraged users from purchasing music.
In practically every application, it is desirable to have a summary of the multimedia information. One type of summary is an excerpted segment of the multimedia information. In order to be an effective summary, it is highly desirable that the segment be a good representation of the entire work. Unfortunately, existing algorithms for producing summaries do little to ensure that the summary is representative of the longer multimedia information.
One prior approach to producing a summary is to always select a specific time segment of the multimedia information for the summary. For example, this approach might always select the first 30 seconds of an audio track as the summary. The results of this crude approach may be very unsatisfying if, for example, the bulk of the audio track bears little resemblance to an idiosyncratic introduction.
Other prior approaches to automatic summarization must be specifically tailored to the specific type of multimedia information. For video summarization, video is partitioned into segments and the segments are clustered according to similarity to each other. The segment closest to the center of each cluster is chosen as the representative segment for the entire cluster. Other video summarization approaches attempt to summarize video using various heuristics typically derived from analysis of closed captions accompanying the video. These approaches rely on video segmentation, or require either clustering or training.
Audio summarization techniques typically use a segmentation phase to segment the audio into segments. Typically, this is done by looking for audio features such as silence or pitch. Representative segments are then selected based on various criteria. If these features are absent from a particular multimedia source, these techniques perform poorly.
Text summarization typically uses term frequency/inverse document frequency to select paragraphs, sentences, or key phrases that are both representative of the document and significantly different from other documents. This requires knowledge about the content of other documents.
It is desirable to have a method for producing automatic summaries which 1) is capable of working on any type of multimedia information; 2) produces a good representation of the entire work; 3) does not depend on specific features of the multimedia information; and 4) requires no segmentation, clustering, or training. Additionally, it is advantageous to have a method which can easily produce a summary of the desired length.