The present invention relates to digital video content analysis and more particularly to a system for summarizing digital video sequences as a series of representative key frames.
The increasing availability and use of video have created a need for video summaries and abstractions to aid users in effective and efficient browsing of potentially thousands of hours of video. Automation of video content analysis and extraction of key representative content to create summaries has increased in significance as video has evolved from an analog to a digital format. Digital television, digital video libraries, and the Internet are applications where an appliance that can "view" the video and automatically summarize its content might be useful.
Generally, a sequence of video includes a series of scenes. Each scene, in turn, includes a series of adjoining video "shots." A shot is a relatively homogeneous series of individual frames produced by a single camera focusing on an object or objects of interest belonging to the same scene. Generally, automated video content analysis and extraction involve "viewing" the video sequence, dividing the sequence into a series of shots, and selecting one or more "key frames" from each of the shots to represent the content of the shot. A summary of the video sequence results when the series of key frames is displayed. The summary of the video will best represent the video sequence if the frames which are most representative of the content of each shot are selected as key frames for inclusion in the summary. Creation of a hierarchy of summaries, including a greater or lesser number of key frames from each shot, is also desirable to satisfy the differing needs of users of the video.
The first step in the summarization process has been the division of the video into a series of shots of relatively homogeneous content. Video shot transitions can be characterized by anything from abrupt transitions occurring between two consecutive frames (cuts) to more gradual transitions, such as "fades," "dissolves," and "wipes." One technique for detecting the boundaries of a shot involves counting either the number of pixels or the number of predefined areas of an image that change in value by more than a predefined threshold in a subsequent frame. When either the total number of pixels or areas satisfying this first criterion exceeds a second predefined threshold a shot boundary is declared. Statistical measures of the values of pixels in pre-specified areas of the frame have also been utilized for shot boundary detection. Pixel difference techniques can be sensitive to camera and object motion. Statistical techniques tend to be relatively slow due to the complexity of computing the statistical formulas.
Histograms and histogram related statistics are the most common image representations used in shot boundary detection. Gray level histograms, color histograms, or histogram related statistics can be compared for successive frames. If the difference exceeds a predefined threshold, a shot boundary is detected. A second threshold test may also be included to detect the more gradual forms of shot transition.
Selecting one or more key frames which best represent the relatively homogeneous frames of a shot has been more problematic than defining shot boundaries. Lagendijk et al. in a paper entitled VISUAL SEARCH IN A SMASH SYSTEM, Proceedings of the International Conference on Image Processing, pages 671-674, 1996, describe a process in which shot boundaries are determined by monitoring cumulative image histogram differences over time. The frames of each shot are temporally divided into groups reflecting the pre-specified number of key frames to be extracted from each shot. The frame at the middle of each group of frames is then selected as the key frame for that group. The selection of a key frame is arbitrary and may not represent the most "important" or "meaningful" frame of the group. Also, this process must be performed "off-line" with storage of the entire video for "review" and establishment of shot boundaries, followed by temporal segmentation of shots and then extraction of key frames. For key frame extraction, the stored video must be loaded into a processing buffer so that the group of frames and associated key frames can be calculated. The size of a shot is limited by the size of the processing buffer.
In the copending application of Ratakonda, Ser. No. 08/994,558, filed Dec. 19, 1997, shot boundaries are determined by monitoring variations in the differences in image histograms over time. Individual shots are further partitioned into segments which represent highly homogeneous groups of frames. The partitioning of shots into segments is achieved through an iterative optimization process. For each video segment, the frame differing most from the key frame of the prior segment is selected as the next key frame of the summary. A key frame is selected on the basis of the frame's difference from the prior key frame and not on the basis of its representation of the other frames belonging to the segment. Like the technique proposed by Lagendijk, this technique must be performed off-line and an entire video shot must be stored for review, segment partitioning, and key frame selection. Additional memory is required to store the prior key frame for comparison.
Zhang et al., U.S. Pat. No. 5,635,982, disclose a method in which the difference between frames is monitored and accumulated. When the accumulated difference exceeds a predefined threshold, a potential key frame is detected. The potential key frame is designated as a key frame if, in addition, the difference between the potential key frame and the previous key frame exceeds a preset threshold. Without additional processing, the locations of key frames always coincide with the beginning a new shot.
Smith et al., in a paper entitled VIDEO SKIMMING AND CHARACTERIZATION THROUGH THE COMBINATION OF IMAGE AND LANGUAGE UNDERSTANDING TECHNIQUES and Mauldin et al., U.S. Pat. No. 5,664,227 disclose an elaborate key frame identification technique based on context rules related to repetitiveness, degrees of motion, and audio and video content. The key frame sequences can be used to provide compact summaries of video sequences but the method is complex and does not support creation of hierarchical video summaries.
What is desired is a technique of automated video content analysis and key frame extraction which selects key frames that are the most representative frames of each shot or segment of the video sequence. Simple implementation, conservation of computational resources, and the ability to accept a variety of inputs are desirable characteristics of such a technique. It is desired that the technique provide for content analysis and key frame extraction both "on-line (in real time)," without the need to store the entire video sequence, and "off-line." Further, a technique of conveniently creating a hierarchy of summaries, each successively containing a smaller number of the most representative frames, is desired.