1. Field of the Invention
This invention relates to methods of feature extraction, preferably in combination with scene change detection for video signal sequences of the types encountered in, for example, High Definition Television (HDTV) broadcast signals or other compressed forms of video information such as might be encountered on the world wide web communications medium.
2. Description of the Prior Art
Basic methods for compressing the bandwidth of digital color video signals have been adopted by the Motion Picture Experts Group (MPEG).
The MPEG standards achieve high data compression rates by developing information for a full frame of the image only every so often. The full image frames, or intra-coded pictures are called xe2x80x9cI-framesxe2x80x9d, and contain full frame information independent of any other frames. B-frames and P-frames are encoded between the I-frames and store only image differences with respect to the reference anchor frames.
Typically, each frame of a video sequence is partitioned into smaller blocks of pixel data and each block is subjected to a discrete cosine transformation (DCT) function to convert the statistically dependent spatial domain picture elements (pixels) into independent frequency domain DCT coefficients.
Respective 8xc3x978 blocks of pixels are subjected to the Discrete Cosine Transform (DCT) to provide the coded signal. The resulting coefficients typically are subjected to adaptive quantization, and then are run-length and variable-length encoded. Thus, the blocks of transmitted data typically include fewer than an 8xc3x978 matrix of codewords. Macroblocks of intraframe encoded data (I-frames) will also include information such as the level of quantization employed, a macroblock address or location indicator, and a macroblock type, the latter information being referred to as xe2x80x9cheaderxe2x80x9d or xe2x80x9coverheadxe2x80x9d information.
The blocks of data encoded according to P or B interframe coding also consist of matrices of Discrete Cosine Coefficients. In this instance, however, the coefficients represent residues or differences between a predicted 8xc3x978 pixel matrix and the actual 8xc3x978 pixel matrix. These coefficients also are subjected to quantization and run- and variable-length coding. In the frame sequence, I and P frames are designated anchor frames. Each P frame is predicted from the lastmost occurring anchor frame. Each B frame is predicted from one or both of the anchor frames between which it is disposed. The predictive coding process involves generating displacement vectors, which indicate which block of an anchor frame most closely matches the block of the predicted frame currently being coded. The pixel data of the matched block in the anchor frame is subtracted, on a pixel-by-pixel basis, from the block of the frame being encoded, to develop the residues. The transformed residues and the vectors comprise the coded data for the predictive frames. As with intraframe coded frames, the macroblocks include quantization, address and type information.
The results are usually energy concentrated so that only a few of the coefficients in a block contain the main part of the picture information. The coefficients are quantized in a known manner to effectively limit the dynamic range of ones of the coefficients and the results are then run-length and variable-length encoded for application to a transmission medium.
The so-called MPEG-4 format is described in xe2x80x9cMPEG-4 Video Verification Model Version 5.0xe2x80x9d, distributed by the Adhoc Group on MPEG-4 Video VM Editing to its members under the designation ISO/IEC JTCI/SC29/WG11 MPEG 96/N1469, November 1996. The MPEG-4 video coding format produces a variable bit rate stream at the encoder from frame to frame (as was the case with prior schemes). Since the variable bit rate stream is transmitted over a fixed rate channel, a channel buffer is employed to smooth out the bit stream. In order to prevent the buffer from overflowing or underflowing, rate control of the encoding process is employed.
With the advent of new digital video services, such as video distribution on the world wide web, there is an increasing need for signal processing techniques for identifying and extracting information regarding features of the video sequences. Identification of scene changes, whether they are abrupt or gradual, are useful for the purposes of indexing image changes and thereafter, scenes may be analyzed automatically to determine certain features or characteristics of the particular material.
In the future, it should be expected that a significant amount of digital video material will be provided in the form of compressed or coded data as described above. Operating on the video sequence information in its compressed form, rather than its decompressed or decoded form, where possible, usually permits more rapid processing because of the reduction in data size. It is advantageous to develop methods and techniques which permit operating directly on compressed data, rather than having to perform full frame decompression before other processing is performed.
It has also been known that when a block (macroblock) contains an edge boundary of an object, the energy in that block after transformation, as represented by the DCT coefficients, includes a relatively large DC coefficient (top left corner of matrix) and randomly distributed AC coefficients throughout the matrix. A non-edge block, on the other hand, usually is characterized by a similar large DC coefficient (top left corner) and a few (e.g. two) adjacent AC coefficients which are substantially larger than other coefficients associated with that block. This information relates to image changes in the spatial domain and, when combined with image difference information obtained from comparing successive frames (i.e. temporal differences) factors are available for distinguishing one video object (VO) from another. Use of DC values of macroblocks of an image result in a blurred version of the original image which retains much of the content of the original.
Thus, previous work in feature extraction for indexing from compressed video had mostly emphasized DC coefficient extraction. In a paper entitled xe2x80x9cRapid Scene Analysis on Compressed Videoxe2x80x9d, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, Dec. 1995, page 533-544, Yeo and Liu describe an approach to scene change detection in the MPEG-2 compressed video domain, as well as review earlier efforts at detecting scene changes based on sequences of entire (uncompressed) image data, and various compressed video processing techniques of others. Yeo and Liu introduced the use of spatially reduced versions of the original images, so-called DC images, and DC sequences extracted from compressed video to facilitate scene analysis operations. Their xe2x80x9cDC imagexe2x80x9d is made up of pixels which are the average value of the pixels in a block of the original image and the DC sequence is the combination of the resulting reduced number of pixels of the DC image.
Won et al, in a paper published in Proc. SPE Conf. on Storage and Retrieval for Image and Video Databases, January 1998, describe a method to extract features from compressed MPEG-2 video by making use of the bits expended on the DC coefficients to locate edges in the frames. However, their work is limited to I-frames only. Kobla et al describe a method in the same Proceedings using the DC image extraction of Yeo et al to form video trails that characterize the video clips. Feng et al (IEEE International Conference on Image Processing, Vol. 11, pp. 821-824, Sep. 16-19, 1996), use the bit allocation across the macroblocks of MPEG-2 frames to detect abrupt scene changes, without extracting DC images. Feng et al""s technique is computationally the simplest since it does not require significant computation in addition to that required for parsing the compressed bitstream.
In accordance with inventions of the present inventors and a co-worker, which are described in recently filed, commonly owned applications, computationally simple methods have been devised which employ combinations of certain aspects of Feng et al""s approach and Yeo et al""s approach to give accurate and simple scene change detection. Advantageously, techniques that make use of bit allocation information in accordance with the methods of the present invention are employed, preferably in accordance with the scene change detection techniques, to extract feature information.
It should be noted that the DC image extraction based technique is good for I-frames since the extraction of the DC values from I-frames is relatively simple. However, for P-frames, additional computation is needed.
It has been determined that, once a suspected scene/object change has been accurately located in a group of consecutive frames/objects by use of a DC image extraction based technique, application of an appropriate bit allocation-based technique and/or an appropriate DC residual coefficient processing techniques to P-frame information in the vicinity of the suspected scene information quickly and accurately locates the cut point. This combined method is applicable to either MPEG-2 sequences or MPEG-4 multiple object sequences. In the MPEG-4 case, it has been found to be advantageous to use a weighted sum of the changes in each object of the frame, using the area of each object as the weighing factor.