It is desired to automatically generate summaries of videos, and more particularly, to generate the summaries of a compressed digital videos.
Compressed Video Formats
Basic standards for compressing a video as a digital signal have been adopted by the Motion Picture Expert Group (MPEG). The MPEG standards achieve high data compression rates by developing information for a full frame of the image only every so often. The full image frames, i.e., intra-coded frames, are often referred to as “I-frames” or “anchor frames,” and contain full frame information independent of any other frames. Image difference frames, i.e., inter-coded frames, are often referred to as “B-frames” and “P-frames,” or as “predictive frames,” and are encoded between the I-frames and reflect only image differences i.e., residues, with respect to the reference frame.
Typically, each frame of a video sequence is partitioned into smaller blocks of picture elements, i.e., pixel data. Each block is subjected to a discrete cosine transformation (DCT) function to convert the statistically dependent spatial domain pixels into independent frequency domain DCT coefficients. Respective 8×8 or 16×16 blocks of pixels, referred to as “macro-blocks,” are subjected to the DCT function to provide the coded signal.
The DCT coefficients are usually energy concentrated so that only a few of the coefficients in a macro-block contain the main part of the picture information. For example, if a macro-block contains an edge boundary of an object, then the energy in that block includes a relatively large DC coefficient and randomly distributed AC coefficients throughout the matrix of coefficients.
A non-edge macro-block, on the other hand, is usually characterized by a similarly large DC coefficient and a few adjacent AC coefficients which are substantially larger than other coefficients associated with that block. The DCT coefficients are typically subjected to adaptive quantization, and then are run-length and variable-length encoded. Thus, the macro-blocks of transmitted data typically include fewer than an 8×8 matrix of codewords.
The macro-blocks of inter-coded frame data, i.e., encoded P or B frame data, include DCT coefficients which represent only the differences between a predicted pixels and the actual pixels in the macro-block. Macro-blocks of intra-coded and inter-coded frame data also include information such as the level of quantization employed, a macro-block address or location indicator, and a macro-block type. The latter information is often referred to as “header” or “overhead” information.
Each P-frame is predicted from the last I- or P-frame. Each B-frame is predicted from an I- or P-frame between which it is disposed. The predictive coding process involves generating displacement vectors, often referred to as “motion vectors,” which indicate the magnitude of the displacement to the macro-block of an I-frame most closely matches the macro-block of the B- or P-frame currently being coded. The pixel data of the matched block in the I frame is subtracted, on a pixel-by-pixel basis, from the block of the P- or B-frame being encoded, to develop the residues. The transformed residues and the vectors form part of the encoded data for the P- and B-frames.
Video Analysis
Video analysis can be defined as processing a video with the intention of understanding the content of a video. The understanding of a video can range from a “low-level” syntactic understanding, such as detecting segment boundaries in the video, to a “high-level” semantic understanding, such as detecting a genre of the video. The low-level understanding can be achieved by analyzing low-level features, such as color, motion, texture, shape, and the like, to generate content descriptions. The content description can then be used to index the video.
Video Summarization
Video summarization generates a compact representation of a video that conveys the semantic essence of the video. The compact representation can include “key-frames” or “key-segments,” or a combination of key-frames and key-segments. As an example, a video summary of a tennis match can include two frames, the first frame capturing both of the players, and the second frame capturing the winner with the trophy. A more detailed and longer summary could further include all frames that capture the match point. While it is certainly possible to generate such a summary manually, this is tedious and costly. Automatic summarization is therefore desired.
Automatic video summarization methods are well known, see S. Pfeifer et al. in “Abstracting Digital Movies Automatically,” J. Visual Comm. Image Representation, vol. 7, no. 4, pp. 345–353, December 1996, and Hanjalic et al. in “An Integrated Scheme for Automated Video Abstraction Based on Unsupervised Cluster-Validity Analysis,” IEEE Trans. On Circuits and Systems for Video Technology, Vol. 9, No. 8, December 1999.
Most prior video summarization methods focus almost exclusively on color-based summarization. Only Pfeiffer et al. has used motion, in combination with other features, to generate video summaries. However, their approach merely uses a weighted combination that overlooks possible correlation between the combined features. Some summarization methods also use motion features to extract key-frames.
As shown in FIG. 1, prior art video summarization methods have mostly emphasized clustering, based on color features, because color features are easy to extract in the compressed domain, and are robust to noise. A typical method takes a video sequence A 101 as input, and applies a color based summarization process 100 to produce a video summary S(A) 102. The video summary includes either a summary of the entire sequence, or a set of interesting segments of the sequence, or key-frames.
The method 100 typically includes the following steps. First, cluster the frames of the video according to color features. Second, arrange the clusters in an easy to access hierarchical data structure. Third, extract a key-frame or a key sequence from each of the cluster to generate the summary.
Motion Activity Descriptor
A video can also be intuitively perceived as having various levels of activity or intensity of action. An examples of a relatively high level of activity is a scoring opportunity in a sport video. On the other hand, a news reader video has a relatively low level of activity. The recently proposed MPEG-7 video standard provides for a descriptor related to the motion activity in a video.
One measure of motion activity can be the average and variance of the magnitude of the motion vectors, see Peker et al. “Automatic measurement of intensity of motion activity,” Proceedings of SPIE Conference on Storage and Retrieval for Media Databases, January 2001. However, there are many variations possible, depending on the application.
Fidelity of a Set of Key-Frames
The simplest approach to finding a single key-frame is to select an arbitrary frame from the sequence, but single key-frame based approaches fail when the video content has more information than can be conveyed by the single frame. The first frame of a video segment can be assigned as the first key-frame, and then the frame at the greatest distance in feature space from the first frame can be assigned as the second key-frame, see In M. M. Yeung and B. Liu, “Efficient Matching and Clustering of Video Shots,” Proc. IEEE ICIP, Washington D.C., 1995. Other multiple key-frame generation techniques, and a key-frame generation technique based on a measure of fidelity of a set of key-frames are described by H. S. Chang, S. Sull and S. U. Lee, “Efficient video indexing scheme for content-based retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8, 1999. The fidelity measure is defined as the Semi-Hausdorff distance (dhs) between the set of key-frames S and the set of frames R in the video sequences. A practical definition of the Semi-Hausdorff distance is as follows.
Let the set Si include of m frames for i=1, . . . , m, and let the set of frames Ri include n frames for I=1, . . . , n. Let the distance between two frames Si and Ri be d(Si, Ri). Define di for each frame Ri asdi=min(d(Sk,Ri)),k=0 . . . m.
Then the Semi-Hausdorff distance between S and R is given bydsh(S,R)=max(di),i=1 . . . n.
In other words, first for all i, measure the distance di between the frame Ri and its best representative in the key-frame set S. Next, find the maximum of the distances di computed above. The distance represents how well the key-frame set S represents R. For a better representation, the Semi-Hausdorff distance between S and R is smaller. For example, in the trivial case, if the sets S and R are identical, then the Semi-Hausdorff distance is zero. On the other hand, a large distance indicates that at least one of the frames in R was not well represented by any of the frames in the key-frame set S.
Most existing dissimilarity measures satisfy the properties required for the distance over a metric space used in the above definition. One can also use a color histogram intersection metric described by M. J. Swain and D. H. Ballard, “Color indexing,” J. Computer Vision, vol. 7, no. 1, pp. 11–32, 1991, which is defined as follows.
If the K-bin color histograms of two images fi and fi of size M×N, are Hi and Hj, then the dissimilarity between the two images is given by
      d    ⁡          (                        f          i                ,                  f          j                    )        =      1    -                  1                  M          ×          N                    ⁢                        ∑                      k            =            1                    K                ⁢                                  ⁢                  min          ⁢                                    {                                                                    H                    i                                    ⁡                                      (                    k                    )                                                  ,                                                      H                    j                                    ⁡                                      (                    k                    )                                                              }                        .                                                  ⁢            Note                    ⁢                                          ⁢          that          ⁢                                          ⁢          the          ⁢                                          ⁢          dissimilarity          ⁢                                          ⁢          is          ⁢                                          ⁢          within          ⁢                                          ⁢          the          ⁢                                          ⁢                                    range              ⁢                                                          [                              0                ,                1                            ]                        .                              