Standard Processing Techniques
Basic standards for processing a video encoded as a digital signal have been adopted by the Motion Picture Expert Group (MPEG). The MPEG standards achieve high data compression rates by developing information for full frames of the video only every so often. The full frames, i.e., intra-coded frames, are often referred to as “I-frames” or “reference frames,” and contain full frame information independent of any other frames. Image difference frames, i.e., inter-coded frames, are often referred to as “B-frames” and “P-frames,” or as “predictive frames,” and are encoded between the I-frames and reflect only image differences i.e., residues with respect to the reference frame.
Typically during the processing, each frame of a video is partitioned into smaller blocks of picture element, i.e., pixel data. Each block is subjected to a discrete cosine transformation (DCT) function to convert the statistically dependent spatial domain pixels into independent frequency domain DCT coefficients. Respective 8×8 or 16×16 blocks of pixels, referred to as “macro-blocks,” are subjected to the DCT function to provide the encoded signal. The DCT coefficients are usually energy concentrated so that only a few of the coefficients in a macro-block contain the main part of the picture information. For example, if a macro-block contains an edge boundary of an object, then the energy in that block, after transformation, as represented by the DCT coefficients, includes a relatively large DC coefficient and randomly distributed AC coefficients throughout the matrix of coefficients.
A non-edge macro-block, on the other hand, is usually characterized by a similarly large DC coefficient and a few adjacent AC coefficients which are substantially larger than other coefficients associated with that block. The DCT coefficients are typically subjected to adaptive quantization, and then are run-length and variable-length encoded. Thus, the macro-blocks of transmitted data typically include fewer than an 8×8 matrix of code words.
The macro-blocks of inter-coded frame data, i.e., encoded P or B frame data, include DCT coefficients which represent only the differences between predicted pixels and actual pixels in the macro-block. Macro-blocks of intra-coded and inter-coded frame data also include information such as the level of quantization employed, a macro-block address or location indicator, and a macro-block type. The latter information is often referred to as “header” or “overhead” information. This provides good spatial compression of the video.
Each P-frame is predicted from the last most occurring I- or P-frame. Each B-frame is predicted from an I- or P-frame between which the B-frame is disposed. The predictive coding process involves generating displacement vectors, often referred to as “motion vectors,” which indicate a magnitude of the displacement of the macro-block of an I-frame that most closely matches the macro-block of the B- or P-frame currently being coded. The pixel data of the matched block in the I frame are subtracted, on a pixel-by-pixel basis, from the block of the P- or B-frame being encoded, to develop the residues. The transformed residues and the vectors form part of the encoded data for the P- and B-frames. This provides good temporal compression.
Video Analysis
Video analysis can be defined as processing a video with the intention of understanding the content of the video. The understanding of the video can range from a “low-level” syntactic understanding, such as detecting segment boundaries or scene changes in the video, to a “high-level” semantic understanding, such as detecting a genre of the video. The low-level understanding can be achieved by analyzing low-level features, such as color, motion, texture, shape, and the like, to generate content descriptions. The content description can then be used to index the video. The high-level understanding can be encoded at the source, or in some instances derived from low-level features, see Yeo et al. “Rapid scene analysis on compressed videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 5:pp 533–544, 1995, Meng et al. “CVEPS: A compressed video editing and parsing system,” ACM Multimedia Conference, 1996, and Chang et al. “Compressed-domain techniques for image/video indexing and manipulation,” IEEE International Conference on Image Processing, Volume-I, pp. 314–317, 1995.
Video Summarization
Video summarization can be defined as a process that produces a compact representation of a video that still conveys the semantic essence of the video. The compact representation can include key frames or key segments, or a combination of key frames and segments. As an example, a video summary of a tennis match can include a small key segment and a key frame. The key segment captures both of the players in action during the very last winning return, and the key frame captures the winner with the trophy. A more detailed and longer summary could include all frames of the match game or point. While it is certainly possible to generate such a summary manually, this is tedious and costly.
Automatic video summarization methods are well known, see S. Pfeiffer et al. in “Abstracting Digital Movies Automatically,” J. Visual Comm. Image Representation, vol. 7, no. 4, pp. 345–353, December 1996, and Hanjalic et al. in “An Integrated Scheme for Automated Video Abstraction Based on Unsupervised Cluster-Validity Analysis,” IEEE Trans. On Circuits and Systems for Video Technology, Vol. 9, No. 8, December 1999.
Most known video summarization methods focus on color-based summarization. Pfeiffer et al. also uses motion, in combination with other features, to generate video summaries. However, their approach merely uses a weighted combination that overlooks possible correlation between the combined features.
While color descriptors are robust, they do not include the motion characteristics of the video sequence by definition. On the other hand, motion descriptors tend to be less robust to noise than color descriptors and have generally not been as widely used for summarization.
The level of motion activity in a video can be a measure of how much the scene acquired by the video is changing. Therefore, the motion activity can be considered a measure of the “summarizability” of the video. For instance, a high speed car chase will certainly have many more “changes” in it compared to a scene of a news-caster, and thus, the high speed car chase scene will require more resources for a visual summary than would the news-caster scene.
It is desired to adaptively process a video using content characteristics of frames in the video. During the processing, play time for the frames of the video should be allocated on a basis of content characteristics.