It should be possible to analyze a compressed video without having to decompress the video. Analyzing a compressed video should take less effort because there is less data. However, in a compressed video, none of the original picture information such as the intensity of the pixel colors is available. When a video is compressed according to the MPEG standards, the bit stream is converted to I-, B-, and P-frames. The I-frames store DCT information of original video frames, and the B- and P-frames store motion information and residuals after motion compensation. Although, I-frames do not store motion information, static color and texture information can be propagated to the B- and P-frames by inverse motion compensation.
Compressed videos have several important characteristics useful for object analysis. First, motion information is readily available without having to estimate the motion field. Second, DCT information express image characteristics. However, the motion vectors are often contaminated by mismatching. In addition, the motion fields in MPEG compressed bit streams are prone to quantization errors. Therefore, motion analysis on an uncompressed video provides better information. However, decompressing a video to recover the original information is very time consuming, and it might not be possible to do the required analysis of the compressed video in real time, if the video first has to be decompressed.
In the prior art, some methods are known for analyzing compressed images. One method segments JPEG documents into specific regions, such halftones, text, and continuous-tone pictures, see De Queiroz et al., “Optimizing block thresholding segmentation for multilayer compression of compound images,” IEEE Trans. Image Proc. pp. 1461–1471, 2000. They used an encoding cost map based segmentation. However, the JPEG standard only deals with single still images. Therefore, it is not possible to segment arbitrary 3D objects from still images.
Wang et al., in “Automatic face region detection in MPEG video sequences,” Electronic Imaging and Multimedia Systems, SPIE Photonics, 1996. described a process for detecting faces in an MPEG compressed video. They used chrominance, i.e., skin-tone statistics, face shape constraints, and energy distribution of the luminance values to detect and locate the faces. Their method is not general, and does not works for videos containing an unknown number of arbitrary objects of unknown color and shape.
Meng et al., in “Tools for compressed-domain video indexing and editing,” SPIE Proceedings, 2670:180–191, 1996, used a block count method to estimate parameters in a three-parameter affine global motion model. Then, they performed global motion compensation to obtain object masks, and used histogram clustering to deal with multiple objects.
Sukmarg et al., in “Fast algorithm to detect and segmentation in MPEG compressed domain,” IEEE TENCON, 2000, described an algorithm for detecting and segmenting foreground from background in an MPEG compressed video using motion information. Their segmentation has four main stages, initial segmentation with sequential leader and adaptive k-means clustering, region merging based on spatio-temporal similarities, foreground-background classification, and object detail extraction. Initial segmented regions are generated from 3D spatial information based on DC image and AC energy data. That information is used to cluster the image. After clusters are obtained, adaptive k-means clustering is applied until no more changes occur in each cluster. A temporal similarity is derived based on a Kolmogorov-Smirnov hypothesis test of the distribution of the temporal gradient, see An et al., “A Kolmogorov-Smirnov type statistic with applications to test for normality in time series,” International Statistics Review, 59:287–307, 1991. The hypothesis test measures the overall difference between two cumulative distribution functions. The spatio-temporal similarities are used to construct a similarity graph between regions. The graph is thresholded and clustered. A first clustering stage is used to merge regions, which form cycles in the graph. A second clustering stage is used to merge regions based on the number of graph edges connecting between an interested cluster and its neighbor cluster, and those connecting within the interested cluster itself.
An essential step in video segmentation is partitioning the video into sequences of images called scenes or ‘shots’. A shot is a sequence of images that is consistent in terms of content. Typically, a shot comprises a sequence of frames between a camera shutter opening and closing. Shots have been identified as a fundamental unit of a video, and their detection is an initial task in video segmentation. Numerous techniques are known for shot detection.
After shots are identified, it is possible to analyze their content based on motion, color, texture and others features.
Shot detection can be data driven or model driven. The data driven methods fall into two classes. Those based on global features, and those based on spatially registered features of the images. Methods based on global features, i.e., color histograms, are insensitive to motion, however, they can fail to detect scene cuts when the images before and after the shot cut have similar global features. The methods based on spatially registered features are sensitive to moving objects, and can fail when the image is extremely slow or fast. The model driven approach is based on mathematical models.
Flickner et al., in “Query by image and video content,” IEEE Computer, pages 23–31, 1995, described shot detection with a global representation, such as color histogram and spatially related features. It should be noted that colors are not directly available in the compressed domain.
Corridoni et al., in “Automatic video segmentation through editing analysis,” Lecture Notes in Computer Science, 974:179–190, 1995, described a method based on a relative difference between frames. They expect a shot cut when a difference between two frames is much larger than a threshold difference between frames belonging to the same shot. The threshold value was determined experimentally.
Nagasaka et al., in “Automatic scene-change detection method for video works,” Proc. 40th National Con. Information Processing Society of Japan, 1990, applied a template matching technique and a X2 test to the color histograms of two subsequent frames.
Arman et al, in “Image processing on compressed data for large video databases,” ACM Multimedia, pp. 267–272, 1993, described a shot detection technique that operate directly on compressed video using known properties of the coefficients of the DCT.
More recent methods use DCT coefficients and motion vector information for shot detection, see Zhang et al., “Video parsing and browsing using compressed data,” Multimedia Tools and Applications, 1(1):89–111, 1995, neural networks, see Ardizzone et al., “A real-time neural approach to scene cut detection,” Proc. of IS-T/SPIE—Storage and Retrieval for Image and Video Databases IV, 1996, and reduced image sequences, see Yeo et al., in “Rapid scene change detection on compressed video,” IEEE Transactions on Circuits and Systems for Video Technology, 5:533–544, 1995.
Although those methods are sufficient for segmenting a video into shots, they are insufficient for segmenting 3D objects from compressed videos.