The present invention relates to a method of summarizing or abstracting video and, more particularly, a method for using information related to video obtained from a source other than the video to create an audio-video semantic summary of video.
The dramatic increase in the quantity of available video, a trend which is expected to continue or accelerate, has increased the need for an automated means of summarizing video. A summary of a video could be viewed as a preview to, or in lieu of, viewing the complete, unabridged video. Summarization could also be used as a basis for filtering large quantities of available video to create a video abstraction related to a specific subject of interest. However, to be most beneficial the summary or abstraction should be semantically significant, capturing major events and meaning from the video.
There are three broad classes of techniques for creating video summaries. A first class of techniques produces a linear summary of a video sequence. A linear summary comprises a collection of key frames extracted from the video. Groups of similar frames or shots are located in the video sequence, and one or more key frames are selected from each shot to represent the content of the shot. Shot boundary detection and selection of key frames within a shot are based on lower level video analysis techniques, such as frame to frame variation in color distribution or temporal positioning of a frame in a shot. While the creation of linear summaries can be automated, the extraction of a linear summary is not event driven and may only capture a rough abstraction of the video. Linear summaries are useful for video sequences where events are not well defined, such as home video, but are not well suited to producing meaningful summaries of videos containing well defined events, such as videos of sporting events.
A second summary extraction technique produces a video story board. The story board is a graphic presentation of the video comprising a number of nodes and edges. Nodes are created by grouping shots usually on the basis of some low level visual characteristic, such as a color histogram. The edges describe relationships between the nodes and are created by human interaction with the summarizing system. While story boarding can produce meaningful summaries, it relies on human intervention to do so.
A third summary extraction technique involves the creation of semantic video summaries which requires an understanding of the events in the video and, in many cases, some expertise of the domain or subject area portrayed by the video. Obtaining this understanding and expertise through automated means has, heretofore, been problematic. Smith et al., VIDEO SKIMMING FOR QUICK BROWSING BASED ON AUDIO AND IMAGE CHARACTERIZATION, Carnegie-Mellon University Tech Report, CMU-CS-95-186, 1995, utilizes detection of keywords in the audio track or closed captioning accompanying a video as a basis for locating meaningful video segments. However, it is difficult to select appropriate keywords and the selected keywords may be uttered many times as part of some general commentary related to the subject of the video without necessarily signaling the presence of corresponding meaningful visual images.
What is desired, therefore, is a method of creating meaningful event driven video summaries that minimizes the necessity for human intervention in the summarizing process.