The present invention relates generally to extracting features from a sequence of video frames, and more particularly to extracting the spatial distribution of motion activity in a compressed video.
Basic standards for compressing the bandwidth of digital color video signals have been adopted by the Motion Picture Experts Group (MPEG). The MPEG standards achieve compression rates by developing information for a full frame of the image only every so often. The full image frames, i.e., intra-coded frames, are often referred to as xe2x80x9cI-framesxe2x80x9d or xe2x80x9canchor frames,xe2x80x9d and contain full frame information independent of any other frame. Image difference frames, i.e., inter-coded frames, are often referred to as xe2x80x9cB-framesxe2x80x9d and xe2x80x9cP-frames,xe2x80x9d or as xe2x80x9cpredictive frames,xe2x80x9d and are encoded between the I-frames and reflect only image differences, i.e., residues, with respect to the reference frame.
Typically, each frame of a video sequence is partitioned into smaller blocks of picture elements, i.e., pixel data. Each block is subjected to a discrete cosine transformation (DCT) function that converts the statistically dependent spatial domain pixels into independent frequency domain DCT coefficients. Respective 8xc3x978 or 16xc3x9716 blocks of pixels, referred to as xe2x80x9cmacro-blocks,xe2x80x9d are subjected to the DCT function to provide the coded signal.
The DCT coefficients are usually energy concentrated so that only a few of the coefficients in a macro-block represent the main part of the picture information. For example, if a macro-block contains an edge boundary of an object, the energy in that block after transformation, i.e., as represented by the DCT coefficients, includes a relatively large DC coefficient and randomly distributed AC coefficients throughout the matrix of coefficients.
A non-edge macro-block, on the other hand, is usually characterized by a similarly large DC coefficient and a few adjacent AC coefficients that are substantially larger than other coefficients associated with that block. The DCT coefficients are typically subjected to adaptive quantization, and then are run-length and variable-length encoded for the transmission medium. Thus, the macro-blocks of transmitted data typically include fewer than an 8xc3x978 matrix of codewords.
The macro-blocks of inter-coded frame data, i.e., encoded P or B frame data, include DCT coefficients which represent only the differences between predicted pixels and the actual pixels in the macro-block. Macro-blocks of intra-coded and inter-coded frame data also include information such as the level of quantization employed, a macro-block address or location indicator, and a macro-block type. The latter information is often referred to as xe2x80x9cheaderxe2x80x9d or xe2x80x9coverheadxe2x80x9d information.
Each P frame is predicted from the lastmost occurring I or P frame. Each B frame is predicted from an I or P frame between which it is disposed. The predictive coding process involves generating displacement vectors, often referred to as xe2x80x9cmotion vectors,xe2x80x9d which indicate the magnitude of the displacement to the macro-block of an I frame most closely matching the macro-block of the B or P frame currently being coded. The pixel data of the matched block in the I frame is subtracted, on a pixel-by-pixel basis, from the block of the P or B frame being encoded, to develop the residues. The transformed residues and the vectors form part of the encoded data for the P and B frames.
Older video standards, such as ISO MPEG-1 and MPEG-2, are relatively low-level specifications primarily dealing with temporal and spatial compression of video signals. With these standards, one can achieve high compression ratios over a wide range of applications. Newer video coding standards, such as MPEG-4, see xe2x80x9cInformation Technologyxe2x80x94Generic coding of audio/visual objects,xe2x80x9d ISO/IEC FDIS 14496-2 (MPEG4 Visual), November 1998, allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP). These emerging standards are intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. For example, one might want to extract features from a particular type of video object, or to perform for a particular class of video objects.
With the advent of new digital video services, such as video distribution on the Internet, there is an increasing need for signal processing techniques for identifying information in video sequences, either at the frame or object level, for example, identification of activity.
Feature Extraction
Previous work in feature extraction for video indexing from compressed data has primarily emphasized DC coefficient extraction. In a paper entitled xe2x80x9cRapid Scene Analysis on Compressed Video,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, December 1995, page 533-544, Yeo and Liu describe an approach to scene change detection in the MPEG-2 compressed video domain. The authors also review earlier efforts at detecting scene changes based on sequences of entire uncompressed image data, and various compressed video processing techniques of others. Yeo and Liu introduced the use of spatially reduced versions of the original images, so-called DC images, and DC sequences extracted from compressed video to facilitate scene analysis operations. Their xe2x80x9cDC imagexe2x80x9d is made up of pixels which are the average value of the pixels in a block of the original image and the DC sequence is the combination of the reduced number of pixels of the DC image. It should be noted that the DC image extraction based technique is good for I-frames since the extraction of the DC values from I-frames is relatively simple. However, for other type frames, additional computation is needed.
Won et al., in a paper published in Proc. SPIE Conf. on Storage and Retrieval for Image and Video Databases, January 1998, described a method of extracting features from compressed MPEG-2 video by making use of the bits expended on the DC coefficients to locate edges in the frames. However, their work was limited to I-frames only.
Kobla et al. describe a method in the same Proceedings using the DC image extraction of Yeo et al. to form video trails that characterize the video clips.
Feng et al. (IEEE International Conference on Image Processing, Vol. II, pp. 821-824, Sep. 16-19, 1996), used the bit allocation across the macro-blocks of MPEG-2 frames to detect abrupt scene changes, without extracting DC images. Feng et al.""s technique is computationally the simplest since it does not require significant computation beyond that required for parsing the compressed bit-stream.
U.S. Patent Applications entitled xe2x80x9cMethods of scene change detection and fade detection for indexing of video sequencesxe2x80x9d application Ser. No. 09/231,698, filed Jan. 14, 1999), xe2x80x9cMethods of Feature Extraction for Video Sequencesxe2x80x9d application Ser. No. 09/236,838, Jan. 25, 1999, describe computationally simple techniques which build on certain aspects of Feng et al.""s approach and Yeo et al.""s approach to give accurate and simple scene change detection. Once a suspected scene or object change has been accurately located in a group of consecutive frames by use of a DC image extraction based technique, application of an appropriate bit allocation-based technique and/or an appropriate DC residual coefficient processing technique to P or B-frame information in the vicinity of the located scene quickly and accurately locates the cut point. This combined method is applicable to either MPEG-2 frame sequences or MPEG-4 multiple object sequences. In the MPEG-4 case, it is advantageous to use a weighted sum of the change in each object of the frame, using the area of each object as the weighting factor.
U.S. patent application Ser. No. 09/345,452 entitled xe2x80x9cCompressed Bit-Stream Segment Identification and descriptorxe2x80x9d filed by Divakaran et al. on Jul. 1, 1999 describes a technique where magnitudes of displacements of inter-coded frames are determined based on the bits in the compressed bit-stream associated with the inter-coded frames. The inter-coded frame includes macro-blocks. Each macro-block is associated with a respective portion of the inter-coded frame bits which represent the displacement from that macro-block to the closest matching intra-coded frame. The displacement magnitude is an average of the displacement magnitudes of all the macro-blocks associated with the inter-coded frame. The displacement magnitudes of those macro-blocks which are less than the average displacement magnitude are set to zero. The number of run lengths of the zero magnitude macro-blocks is determined and also used to identify the first inter-coded frame.
Motion Activity
Work done so far has focussed on extraction of motion information, and using the motion information for low level applications such as detecting scene changes. There still is a need to extract features for higher level applications. For example, there is a need to extract features that are indicative of the nature of the spatial distribution of the motion activity in a video sequence.
Video or animation sequence can be perceived as a slow sequence, a fast paced sequence, an intermittent sequence and the like. The activity feature captures this intuitive notion of xe2x80x98intensity of actionxe2x80x99 or xe2x80x98pace of actionxe2x80x99 in a video segment. Examples of high and low xe2x80x98activityxe2x80x99 are sporting events and talking heads, respectively.
A good motion activity descriptor would enable applications such as video browsing, surveillance, video content re-purposing, and content based querying of video databases. For example, in video browsing, activity feature can enable clustering of the video content based on a broad description of the activity. For these applications, one needs to go beyond the intensity of the motion activity to other attributes of the activity such as spatial and temporal distribution of activity.
The invention provides a descriptor for the spatial distribution of motion activity in a video sequence. The invention uses the magnitude of motion vectors extracted from the video sequence as a measure of the intensity of motion activity in macro-blocks in the frames of the video. A motion activity matrix Cmv including the magnitudes of the motion vector is constructed for each macro-block of a given P frame.
A threshold is determined for the motion activity matrix. In one embodiment, the threshold is the average magnitude Cmvavg for each macro-block. All elements of Cmv that are less than the threshold are set to zero. Other thresholds can also be used. The threshold can be the average plus some empirically determined constant to provide robustness against noise. The median of the motion vector magnitudes can also be used. This would prevent a few large values from unduly influencing a threshold based on the average. One can also use the most common motion vector magnitude, in the other words, the mode. Because this is basically a clustering problem, one could use any of the well known clustering techniques based on K-means, such as neural nets and vector support machines, to divide the motion vectors into two categories based on their magnitudes. In this case, the boundary between the two clusters can be used as the threshold.
Next, a histogram is constructed for the entire video sequence. The xe2x80x9cbinsxe2x80x9d of the histogram accumulate statistics for areas of distinct and connected regions of non-zero values in the thresholded matrix. Another threshholding process is applied to the histogram, and the histogram is scaled with respect to the average size of the non-zero motions, and thus normalized with respect to the size of the frames.
With a convolution-like similarity measure, the descriptor of the invention has better precision-recall performance than the spatial activity descriptor in the current MPEG-7 experimental model.
It is also possible to capture the effects of camera motion and non-camera motion in distinct uncorrelated parts of the present descriptor. Because the feature extraction takes place in the compressed domain, it can be performed faster than prior art feature extractions from uncompressed video sequences.
In tests on the MPEG-7 test content set, which includes approximately fourteen hours of MPEG-1 encoded video content of different kinds, the present descriptor enables fast and accurate indexing of video. The descriptor is robust to noise and changes in encoding parameters such as frame size, frame rate, encoding bit rate, encoding format, and the like. This is a low-level non-semantic descriptor that gives semantic matches within the same program, and is thus very suitable for applications such as video browsing.
More particularly, the invention provides a method for describing motion activity in a video sequence. A motion activity matrix is determined for the video sequence. A threshold for the motion activity matrix is determined. Connected regions of motion vectors at least equal to the threshold are identified and measured for size. A histogram of the distribution of the sizes of the connected areas is constructed for the entire video sequence. The histogram is normalized to characterize the spatial distribution of the video sequence in a motion activity descriptor.