This invention relates generally to videos, and more particularly to summarizing a compressed video.
It is desired to automatically generate a summary of video, and more particularly, to generate the summary from a compressed digital video.
Compressed Video Formats
Basic standards for compressing a video as a digital signal have been adopted by the Motion Picture Expert Group (MPEG). The MPEG standards achieve high data compression rates by developing information for a full frame of the image only every so often. The full image frames, i.e. intra-coded frames, are often referred to as xe2x80x9cI-framesxe2x80x9d or xe2x80x9canchor frames,xe2x80x9d and contain full frame information independent of any other frames. Image difference frames, i.e., inter-coded frames, are often referred to as xe2x80x9cB-framesxe2x80x9d and xe2x80x9cP-frames,xe2x80x9d or as xe2x80x9cpredictive frames,xe2x80x9d and are encoded between the I-frames and reflect only image differences i.e., residues, with respect to the reference frame.
Typically, each frame of a video sequence is partitioned into smaller blocks of picture element, i.e. pixel, data. Each block is subjected to a discrete cosine transformation (DCT) function to convert the statistically dependent spatial domain pixels into independent frequency domain DCT coefficients. Respective 8xc3x978 or 16xc3x9716 blocks of pixels, referred to as xe2x80x9cmacro-blocks,xe2x80x9d are subjected to the DCT function to provide the coded signal.
The DCT coefficients are usually energy concentrated so that only a few of the coefficients in a macro-block contain the main part of the picture information. For example, if a macro-block contains an edge boundary of an object, then the energy in that block, after transformation, as represented by the DCT coefficients, includes a relatively large DC coefficient and randomly distributed AC coefficients throughout the matrix of coefficients.
A non-edge macro-block, on the other hand, is usually characterized by a similarly large DC coefficient and a few adjacent AC coefficients which are substantially larger than other coefficients associated with that block. The DCT coefficients are typically subjected to adaptive quantization, and then are run-length and variable-length encoded. Thus, the macro-blocks of transmitted data typically include fewer than an 8xc3x978 matrix of codewords.
The macro-blocks of inter-coded frame data, i.e., encoded P or B frame data, include DCT coefficients which represent only the differences between a predicted pixels and the actual pixels in the macro-block. Macro-blocks of intra-coded and inter-coded frame data also include information such as the level of quantization employed, a macro-block address or location indicator, and a macro-block type. The latter information is often referred to as xe2x80x9cheaderxe2x80x9d or xe2x80x9coverheadxe2x80x9d information.
Each P-frame is predicted from the lastmost occurring I- or P-frame. Each B-frame is predicted from an I- or P-frame between which it is disposed. The predictive coding process involves generating displacement vectors, often referred to as xe2x80x9cmotion vectors,xe2x80x9d which indicate the magnitude of the displacement to the macro-block of an I-frame most closely matches the macro-block of the B- or P-frame currently being coded. The pixel data of the matched block in the I frame is subtracted, on a pixel-by-pixel basis, from the block of the P- or B-frame being encoded, to develop the residues. The transformed residues and the vectors form part of the encoded data for the P- and B-frames.
Video Analysis
Video analysis can be defined as processing a video with the intention of understanding the content of a video. The understanding of a video can range from a xe2x80x9clow-levelxe2x80x9d syntactic understanding to a xe2x80x9chigh-levelxe2x80x9d semantic understanding.
The low-level understanding can be achieved by analyzing low-level features, such as color, motion, texture, shape, and the like. The low-level features can be used to partition the video into xe2x80x9cshots.xe2x80x9d Herein, a shot is defined as a sequence of frames that begins when the camera is turned on and lasts until the camera is turned off. Typically, the sequence of frames in a shot captures a single xe2x80x9cscene.xe2x80x9d The low-level features can be used to generate descriptions. The descriptors can then be used to index the video, e.g., an index of each shot in the video and perhaps its length.
A semantic understanding of the video is concerned with the genre of the content, and not its syntactic structure. For example, high-level features express whether a video is an action video, a music video, a xe2x80x9ctalking headxe2x80x9d video, or the like.
Video Summarization
Video summarization can be defined as generating a compact representation of a video that still conveys the semantic essence of the video. The compact representation can include xe2x80x9ckeyxe2x80x9d frames or xe2x80x9ckeyxe2x80x9d segments, or a combination of key frames and segments. As an example, a video summary of a tennis match can include two frames, the first frame capturing both of the players, and the second frame capturing the winner with the trophy. A more detailed and longer summary could further include all frames that capture the match point. While it is certainly possible to generate such a summary manually, this is tedious and costly. Automatic summarization is therefore desired.
Automatic video summarization methods are well known, see S. Pfeifer et al. in xe2x80x9cAbstracting Digital Movies Automatically,xe2x80x9d J. Visual Comm. Image Representation, vol. 7, no. 4, pp. 345-353, December 1996, and Hanjalic et al. in xe2x80x9cAn Integrated Scheme for Automated Video Abstraction Based on Unsupervised Clusterxe2x80x94Validity Analysis,xe2x80x9d IEEE Trans. On Circuits and Systems for Video Technology, Vol. 9, No. 8, December 1999.
Most known video summarization methods focus exclusively on color-based summarization. Only Pfeiffer et al. have used motion, in combination with other features, to generate video summaries. However, their approach merely uses a weighted combination that overlooks possible correlation between the combined features. Some summarization methods also use motion features to extract key frames.
As shown in FIG. 1, prior art video summarization methods have mostly emphasized clustering based on color features, because color features are easy to extract and robust to noise. A typical method takes a video A 101 as input, and applies a color based summarization process 100 to produce a video summary S(A) 102. The video summary consists of either a single summary of the entire video, or a set of interesting frames.
The method 100 typically includes the following steps. First, cluster the frames of the video according to color features. Second, arrange the clusters in an easy to access hierarchical data structure. Third, extract a key frame or a key sequence of frames from each of the cluster to generate the summary.
Motion Activity Descriptor
A video can also be intuitively perceived as having various levels of activity or intensity of action. Examples of a relatively high level of activity is a scoring opportunity in a sporting event video, on the other hand, a news reader video has a relatively low level of activity. The recently proposed MPEG-7 video standard provides for a descriptor related to the motion activity in a video.
It is an objective of the present invention to provide an automatic video summarization method using motion features, specifically motion activity features by themselves and in conjunction with other low-level features such as color and texture features.
The main intuition behind the present invention is based on the following hypotheses. The motion activity of a video is a good indication of the relative difficulty of summarization the video. The greater the amount of motion, the more difficult it is to summarize the video. A video summary can be quantitatively described by the number of frames it contains, for example, the number of key frames, or the number of frames of key segments.
The relative intensity of motion activity of a video is strongly correlated to changes in color characteristics. In other words, if the intensity of motion activity is high, there is a high likelihood that change in color characteristics is also high. If the change in color characteristics is high, then a color feature based summary will include a relatively large number of frames, and if the change in color characteristics is low, then the summary will contain fewer frames.
For example, a xe2x80x9ctalking headxe2x80x9d video typically has a low level of motion activity and very little change in color as well. If the summarization is based on key frames, then one key frame would suffice to summarize the video. If key segments are used, then a one-second sequence of frames would suffice to visually summarize the video. On the other hand, a scoring opportunity in a sporting event would have very high intensity of motion activity and color change, and would thus take several key frames or several seconds to summarize.
More particularly, the invention provides a method that summarizes a video by first extracting intensity of motion activity from a video. It then uses the intensity of motion activity to segment the video into easy and difficult segments to summarize.
Easy to summarize segments are represented by a single frame, or selected frames anywhere in the segment, any frame will do because there is very little difference between the frames in the easy to summarize segment. A color based summarization process is used to summarize the hard segments. This process extracts sequences of frames from each difficult to summarize segment. The single frames and extracted sequences of frames are combined to form the summary of the video.
The combination can use temporal, spatial, or semantic ordering. In a temporal arrangement, the frames are concatenated in some temporal order, for example first-to-last, or last-to-first. In a spatial arrangement, miniatures of the frames are combined into a mosaic or some array, for example, rectangular so that a single frame shows several miniatures of the selected frames of the summary. A semantically ordered summary might go from most exciting to least exciting.