The present invention relates to digital video content analysis and more particularly to a system for summarizing digital video sequences as a series of representative key frames.
The increasing availability and use of video have created a need for video summaries and abstractions to aid users in effective and efficient browsing of potentially thousands of hours of video. Automation of video content analysis and extraction of key representative content to create summaries has increased in significance as video has evolved from an analog to a digital format. Digital television, digital video libraries, and the Internet are applications where an appliance that can xe2x80x9cviewxe2x80x9d the video and automatically summarize its content might be useful.
Generally, a sequence of video includes a series of scenes. Each scene, in turn, includes a series of adjoining video xe2x80x9cshotsxe2x80x9d or segments. A shot or segment is a relatively homogeneous series of individual frames produced by a single camera focusing on an object or objects of interest belonging to the same scene. Generally, automated video content analysis and extraction involve xe2x80x9cviewingxe2x80x9d the video sequence, dividing the sequence into a series of shots, and selecting one or more xe2x80x9ckey framesxe2x80x9d from each of the shots to represent the content of the shot. A summary of the video sequence results when the series of key frames is displayed. The summary of the video will best represent the video sequence if the frames which are most representative of the content of each shot are selected as key frames for inclusion in the summary. Creation of a hierarchy of summaries, including a greater or lesser number of key frames from each shot, is also desirable to satisfy the differing needs of users of the video.
The first step in the summarization process has been the division of the video into a series of shots or segments of relatively homogeneous content. Video shot transitions can be characterized by anything from abrupt transitions occurring between two consecutive frames (cuts) to more gradual transitions, such as xe2x80x9cfades,xe2x80x9d xe2x80x9cdissolves,xe2x80x9d and xe2x80x9cwipes.xe2x80x9d One technique for detecting the boundaries of a shot involves counting either the number of pixels or the number of predefined areas of an image that change in value by more than a predefined threshold in a subsequent frame. When either the total number of pixels or areas satisfying this first criterion exceeds a second predefined threshold a shot boundary is declared. Statistical measures of the values of pixels in pre-specified areas of the frame have also been utilized for shot boundary detection. Pixel difference techniques can be sensitive to camera and object motion. Statistical techniques tend to be relatively slow due to the complexity of computing the statistical formulas.
Histograms and histogram related statistics are the most common image representations used in shot boundary detection. Gray level histograms, color histograms, or histogram related statistics can be compared for successive frames. If the difference exceeds a predefined threshold, a shot boundary is detected. A second threshold test may also be included to detect the more gradual forms of shot transition.
Selecting one or more key frames which best represent the relatively homogeneous frames of a shot has been more problematic than defining shot boundaries. Lagendijk et al. in a paper entitled VISUAL SEARCH IN A SMASH SYSTEM, Proceedings of the International Conference on Image Processing, pages 671-674, 1996, describe a process in which shot boundaries are determined by monitoring cumulative image histogram differences over time. The frames of each shot are temporally divided into groups reflecting the pre-specified number of key frames to be extracted from each shot. The frame at the middle of each group of frames is then selected as the key frame for that group. The selection of a key frame is arbitrary and may not represent the most xe2x80x9cimportantxe2x80x9d or xe2x80x9cmeaningfulxe2x80x9d frame of the group. Also, this process must be performed xe2x80x9coff-linexe2x80x9d with storage of the entire video for xe2x80x9creviewxe2x80x9d and establishment of shot boundaries, followed by temporal segmentation of shots and then extraction of key frames. For key frame extraction, the stored video must be loaded into a processing buffer so that the group of frames and associated key frames can be calculated. The size of a shot is limited by the size of the processing buffer.
In the copending application of Ratakonda, Ser. No. 08/994,558, filed Dec. 19, 1997, shot boundaries are determined by monitoring variations in the differences in image histograms over time. Individual shots are further partitioned into segments which represent highly homogeneous groups of frames. The partitioning of shots into segments is achieved through an iterative optimization process. For each video segment, the frame differing most from the key frame of the prior segment is selected as the next key frame of the summary. A key frame is selected on the basis of the frame""s difference from the prior key frame and not on the basis of its representation of the other frames belonging to the segment. Like the technique proposed by Lagendijk, this technique must be performed off-line and an entire video shot must be stored for review, segment partitioning, and key frame selection. Additional memory is required to store the prior key frame for comparison.
Zhang et al., U.S. Pat. No. 5,635,982, disclose a method in which the difference between frames is monitored and accumulated. When the accumulated difference exceeds a predefined threshold, a potential key frame is detected. The potential key frame is designated as a key frame if, in addition, the difference between the potential key frame and the previous key frame exceeds a preset threshold. Without additional processing, the locations of key frames always coincide with the beginning a new shot.
Smith et al., in a paper entitled VIDEO SKIMMING AND CHARACTERIZATION THROUGH THE COMBINATION OF IMAGE AND LANGUAGE UNDERSTANDING TECHNIQUES and Mauldin et al., U.S. Pat. No. 5,664,227 disclose an elaborate key frame identification technique based on context rules related to repetitiveness, degrees of motion, and audio and video content. The key frame sequences can be used to provide compact summaries of video sequences but the method is complex and does not support creation of hierarchical video summaries.
What is desired is a technique of automated video content analysis and key frame extraction which selects key frames that are the most representative frames of each shot or segment of the video sequence. Simple implementation, conservation of computational resources, and the ability to accept a variety of inputs are desirable characteristics of such a technique. It is desired that the technique provide for content analysis and key frame extraction both xe2x80x9con-line (in real time),xe2x80x9d without the need to store the entire video sequence, and xe2x80x9coff-line.xe2x80x9d Further, a technique of conveniently creating a hierarchy of summaries, each successively containing a smaller number of the most representative frames, is desired.
The present invention overcomes the aforementioned drawbacks of the prior art by providing a method and apparatus for digital video content analysis and extraction based on analysis of feature vectors corresponding to the frames of a video sequence. In the first embodiment of the invention, a method is provided for identifying a key video frame within a segment of video having frames of relatively homogeneous content including the steps of characterizing each video frame as a feature vector; identifying a key feature vector that minimizes the distortion of the group of feature vectors; and identifying the video frame corresponding to the key feature vector as the key video frame. Key frames selected by the method of this first embodiment of the present invention are the frames which are the most representative of the content of the set of frames in each shot of a sequence.
In the second embodiment, a method is provided for determining a second boundary of a video segment within a video sequence comprising the steps of defining a threshold distortion; locating a first frame in the video segment; defining a first feature vector representative of the first frame; including the first feature vector in a set of segment feature vectors; defining a next feature vector representative of a subsequent video frame; including the next feature vector in the set of segment feature vectors; calculating the distortion of the set of segment feature vectors resulting from including the next feature vector in the set; and comparing the distortion of the set of segment feature vectors with the threshold distortion. The steps of characterizing subsequent frames as feature vectors, adding feature vectors to the set, calculating the distortion, and comparing the distortion with the threshold is repeated until the distortion of the set of segment feature vectors has achieved some predefined relationship to the threshold distortion thereby defining the second boundary of the segment. Prior receipt and storage of the entire video sequence are not required for the segmentation process. Key frames can be identified simultaneously with segmentation of the video by applying the methods of both the first or second embodiments.
In the third embodiment of the present invention a method is provided for creating summaries of video sequences including more than one key frame from each segment comprising the steps of dividing the video frames of the sequence into at least one video segment of relatively homogeneous content including at least one video frame; defining a feature vector representative of each of the video frames; ranking the feature vectors representing the frames included in each video segment according to the relative distortion produced in the set of feature vectors representing the segment by each feature vector included in the set; and including in the summary of the sequence, video frames represented by the feature vectors producing relative distortion of specified ranks. Utilizing the method of this third embodiment, a hierarchy of key frames can be identified from which hierarchical summaries of a video sequence can be created with each summary including a greater number of the most representative frames from each segment.
The foregoing and other objectives, features and advantages of the invention will be more readily understood upon consideration of the following detailed description of the invention, taken in conjunction with the accompanying drawings.