I. Field of the Invention
The present invention relates to techniques for editing and parsing compressed digital information, and more specifically, to editing and parsing visual information in the compressed domain.
II. Description of the Related Art
With the increasing use of local area, wide area and global networks to spread information, digital video has become an essential component of many new media applications. The inclusion of video in an application often gives the application not only increased functional utility, but also an aesthetic appeal that cannot be obtained by text or audio information alone. However, while digital video greatly increases our ability to share information, it demands special technical support in processing, communication, and storage.
In order to reduce bandwidth requirements to manageable levels, video information is generally transmitted between systems in the digital environment the form of compressed bitstreams that are in a standard format, e.g., Motion JPEG, MPEG-1, MPEG-2, H.261 or H.263. In these compressed formats, the Discrete Cosine Transform (xe2x80x9cDCTxe2x80x9d) is utilized in order to transform Nxc3x97N blocks of pixel data, where n typically is set to eight, into the DCT domain where quantization is more readily performed. Run-length encoding and entropy coding (i.e., Huffman coding or arithmetic coding) are applied to the quantized bitstream to produce a compressed bitstream which has a significantly reduced bit rate than the original uncompressed source signal. The process is assisted by additional side information, in the form of motion vectors, which are used to construct frame or field-based predictions from neighboring frames or fields by taking into account the inter-frame or inter-field motion that is typically present.
In order to be usable by a receiving system, such coded bitstreams must be both parsed and decoded. For example, in the case of an MPEG-2 encoded bitstream, the bitstream must be parsed into slices and macroblocks before the information contained in the bitstream is usable by an MPEG-2 decoder. Parsed bitstream information may be used directly by an MPEG-2 decoder to reconstruct the original visual information, or may be subjected to further processing.
In the case of compressed digital video, further processing of video information can occur either in the normal, uncompressed domain or in the compressed domain. Indeed, there have been numerous attempts by others in the field to realize useful techniques for indexing and manipulating digital video information in both the uncompressed and compressed domains.
For example, in the article by S. W. Smoliar et al., xe2x80x9cContent-Based Video Indexing and Retrieval,xe2x80x9d IEEE Multimedia, summer 1994, pp. 62-72, a color histogram comparison technique is proposed to detect scene cuts in the spatial (uncompressed) domain. In the article by B. Shahraray, xe2x80x9cScene Change Detection and Content-Based Sampling of Video Sequences,xe2x80x9d SPIE Conf. Digital Image Compression: Algorithms and Technologies 1995, Vol. 2419, a block-based match and motion estimation algorithm is presented.
For compressed video information, the article by F. Arman et al., xe2x80x9cImage Processing on Compressed Data for Large Video Databases,xe2x80x9d Proceedings of ACM Multimedia ""93, June 1993, pp. 267-272, proposes a technique for detecting scene cuts in JPEG compressed images by comparing the DCT coefficients of selected blocks from each frame. Likewise, the article by J. Meng et al., xe2x80x9cScene Change Detection in a MPEG Compressed Video Sequence,xe2x80x9d ISandT/SPIE Symposium Proceedings, Vol. 2419, February 1995, San Jose, Calif., provides a methodology for the detection of direct scene cuts based on the distribution of motion vectors, and a technique for the location of transitional scene cuts based on DCT DC coefficients. Algorithms disclosed in the article by M. M. Yeung, et al. xe2x80x9cVideo Browsing using Clustering and Scene Transitions on Compressed Sequences,xe2x80x9d ISandT/SPIE Symposium Proceedings, February 1995, San Jose, Calif. Vol. 2417, pp. 399-413, enable the browsing of video shots after scene cuts are located. However, the Smoliar et al., Shahraray, and Arman et al. references are limited to scene change detection, and the Meng et al. and Yeung et al. references do not provide any functions for editing compressed video.
Others in the field have attempted to address problems associated with camera operation and moving objects in a video sequence. For example, in the spatial domain, H. S. Sawhney, et al., xe2x80x9cModel-Based 2D and 3D Dominant Motion Estimation for Mosaicking and Video Representation,xe2x80x9d Proc. Fifth Int""l conf. Computer Vision, Los Alamitos, Calif., 1995, pp. 583-390, proposes to find parameters of an affine matrix and to construct a mosaic image from a sequence of video images. In similar vain, the work by A. Nagasaka et al., xe2x80x9cAutomatic Video Indexing and Full-Video Search for Object Appearances,xe2x80x9d in E. Knuth and L. M. Wegner, editors, Video Database Systems, II, Elsevier Science Publishers B.V., North-Holland, 1992, pp. 113-127, proposes searching for object appearances and using them in a video indexing technique.
In the compressed domain, the detection of certain camera operations, e.g., zoom and pan, based on motion vectors have been proposed in both A. Akutsu et al., xe2x80x9cVideo Indexing Using Motion Vectors,xe2x80x9d SPIE Visual Communications and Image Processing 1992, Vol. 1818, pp. 1522-1530, and Y. T. Tse et al., xe2x80x9cGlobal Zoom/Pan Estimation and Compensation For Video Compressionxe2x80x9d Proceedings of ICASSP 1991, pp.2725-2728. In these proposed techniques, simple three parameter models are employed which require two assumptions, i.e., that camera panning is slow and focal length is long. However, such restrictions make the algorithms not suitable for general video processing.
There have also been attempts to develop techniques aimed specifically at digital video indexing. For example, in the aforementioned Smoliar et al. article, the authors propose using finite state models in order to parse and retrieve specific domain video, such as news video. Likewise, in A. Hampapur, et al., xe2x80x9cFeature Based Digital Video Indexing,xe2x80x9d IFIP2.6 Visual Database Systems, III, Switzerland, March, 95, a feature based video indexing scheme using low level machine derivable indices to map into the set of application specific video indices is presented.
One attempt to enable users to manilupate image and video information was proposed by J. Swartz, et al., xe2x80x9cA Resolution Independent Video Language,xe2x80x9d Proceedings of ACM Multimedia ""95, pp. 179-188, as a resolution independent video language (Rivl). However, although Rivl uses group of pictures (GOPs) level direct copying whenever possible for xe2x80x9ccut and pastexe2x80x9d operations on MPEG video, it does not use operations in the compressed domain at frame and macroblock levels for special effects editing. Instead, most video effects in Rivl are done by decoding each frame into the pixel domain and then applying image library routines.
The techniques proposed by Swartz et al. and others which rely on performing some or all video data manipulation functions in the uncompressed domain do not provide a useful, truly comprehensive technique for indexing and manipulating digital video. As explained in S.-F. Chang, xe2x80x9cCompressed-Domain Techniques for Image/Video Indexing and Manipulation,xe2x80x9d IEEE Intern. Conf. on Image Processing, ICIP 95, Special Session on Digital Image/Video Libraries and Video-on-demand, October 1995, Washington, D.C., the disclosure of which is incorporated by reference herein, the compressed-domain approach offers several powerful benefits.
First, implementation of the same manipulation algorithms in the compressed domain is much cheaper than that in the uncompressed domain because the data rate is highly reduced in the compressed domain (e.g., a typical 20:1 to 50:1 compression ratio for MPEG). Second, given most existing images and videos stored in the compressed form, specific manipulation algorithms can be applied to the compressed streams without full decoding of the compressed images/videos. In addition, because that full decoding and re-encoding of video are not necessary, manipulating video in the compressed domain avoids the extra quality degradation inherent in the reencoding process. Thus, as further explained in the article by the present inventors, J. Meng and S.-F. Chang, xe2x80x9cTools for Compressed-Domain Video Indexing and Editing,xe2x80x9d SPIE Conference on Storage and Retrieval for Image and Video Database, Vol. 2670, San Jose, Calif., February 1996, the disclosure of which is incorporated by reference: herein, for MPEG compressed video editing, speed performance can be improved by more than 60 times and the video quality can be improved by about 3-4 dB if a compressed-domain approach is used rather than a traditional decode-edit-reencode approach.
A truly comprehensive technique for indexing and manipulating digital video must meet two requirements. First, the technique must provide for key content browsing and searching, in order to permit users to efficiently browse through or search for key content of the video without full decoding and viewing the entire video stream. In this connection, xe2x80x9ckey contentxe2x80x9d refers to key frames in video sequences, prominent video objects and their associated visual features (motion, shape, color, and trajectory), or special reconstructed video models for representing video content in a video scene. Second, the technique must allow for video editing directly in the compressed domain to allow users to manipulate an specific object of interest in the video stream without having to fully decode the video. For example, the technique should permit a user to cut and paste any arbitrary segment from an existing video stream to produce a new video stream which conforms to the valid compression format.
Unfortunately, none of the prior art techniques available at present are able to meet these requirements. Thus, the prior art techniques fail to permit users who want to manipulate compressed digital video information with the necessary tools to extract a rich set of visual features associated with visual scenes and individual objects directly from compressed video so as not only to enable content based query searches, but also to allow for integration with domain knowledge for derivation of higher-level semantics.
An object of the present invention is to provide comprehensive techniques for indexing and manipulating digital video in the compressed domain.
Another object of the present invention is to provide techniques for key content browsing and searching of compressed digital video without decoding and viewing the entire video stream.
A further object of the present invention is to provide techniques which allow for video editing directly in the compressed domain
A still further object of the present invention is to provide tools that permit users who want to manipulate compressed digital video information to extract a rich set of visual features associated with visual scenes and individual objects directly from compressed video.
Yet another object of this invention is to provide an architecture which permits users to manipulate compressed video information over a distributed network, such as the Internet.
In order to meet these and other objects which will become apparent with reference to further disclosure set forth below, the present invention provides a method for detecting moving video objects in a compressed digital bitstream which represents a sequence of fields or frames of video information for one or more previously captured scenes of video. The described method advantageously provides for analyzing a compressed bistream to locate scene cuts so that at least one sequence of fields or frames of video information which represents a single video scene is determined. The method also provides for estimating one or more operating parameters of a camera which initially captured the video scene,by analyzing a portion of the compressed bitstream which corresponds to the video scene, and for detecting one or more moving video objects represented in the compressed bitstream by applying global motion compensation with the estimated camera operating parameters.
In a preferred process, the compressed bitstream is a bitstream compressed in accordance with the MPEG-1, MPEG-2, H261, or H263 video standard. In this preferred embodiment, analyzing can beneficially be accomplished by parsing the compressed bitstream into blocks of video information and associated motion vector information for each field or frame of video information which comprises the determined sequence of fields or frames of video information representative of said single scene, performing inverse motion compensation on each of the parsed blocks of video information to derive discrete cosign transform coefficients for each of the parsed blocks of video information, counting the motion vector information associated with each of the parsed blocks of video information, and determining from the counted motion vector information and discrete cosign transform coefficient information whether one of the scene cuts has occurred.
In an alternative embodiment, analyzing is performed by parsing the compressed bitstream into blocks of video information and associated motion vector information for each field or frame of video information which comprises the determined sequence of fields or frames of video information representative of the scene, and estimating is executed by approximating any zoom and any pan of the camera by determining a multi-parameter transform model applied to the parsed motion vector information. In an especially preferred process, the frame difference due to camera pan and zoom motion may be modeled by a six-parameter affine transform which represents the global motion information representative of the zoom and pan of the camera.
The detecting step advantageously provides for computing local object motion for one or more moving video objects based on the global motion information and on one or more motion vectors which correspond to the one or more moving video objects. In addition, thresholding and morphological operations are preferably applied to the determined local object motion values to eliminate any erroneously sensed moving objects. Further, border points of the detected moving objects are determined to generate a bounding box for the detected moving object.
The present invention also provides for an apparatus for detecting moving video objects in a compressed digital bitstream which represents a sequence of fields or frames of video information for one or more previously captured scenes of video. Usefully, the apparatus includes means for analyzing the compressed bistream to locate scene cuts therein and to determine at least one sequence of fields or frames of video information which represents a single video scene, means for estimating one or more operating parameters for a camera which initially viewed the video scene by analyzing a portion of the compressed bitstream which corresponds to the video scene, and means for detecting one or more moving video objects represented in the compressed bitstream by applying global motion compensation to the estimated operating parameters.
A different aspect of the present invention provides techniques for dissolving an incoming scene of video information which comprises a sequence of fields or frame of compressed video information to an outgoing scene of video information which comprises a sequence of fields or frame of compressed video information. This technique advantageously provides for applying DCT domain motion compensation to obtain DCT coefficients for all blocks of video information which make up a last frame of the outgoing video scene and the first frame of the incoming video scene, and for creating a frame in the dissolve region frame from the DCT coefficients of the last outgoing frame and the first incoming frame.
In an especially preferred arrangement, an initial value for a weighting function is selected prior to the creation of a first frame in the dissolve region and is used in the creation of the first frame in the dissolve region. The weighting value is then incremented, and a second dissolve frame from the DCT coefficients is generated.
In yet another aspect of the present invention, a technique for masking a compressed frame of digital video information is provided. The technique first determines whether the frame to be masked is intra-coded, predictive-coded or bi-directionally predictive-coded. If the frame is intra-coded, the technique provides for extracting DCT coefficients for all blocks within the frame, examining a blockn to determine where in the frame the block is located, setting DCT coefficients for the block to zero if the block is outside the mask region, and applying a DCT cropping algorithm to the DCT coefficients if the block is on the boundary of the mask region.
If the frame is predictive-coded or bi-directionally predictive-coded, the technique provides for examining motion vectors associated with blockn to determine whether they point to blocks outside or on the mask region, and reencoding the block if a motion vector points to blocks outside the boundary, or on, the mask region.
In still another aspect of the present invention, a technique for generating a frozen frame of video information from a sequence of frames of compressed video information is provided. The technique attractively provides for selecting a frame of compressed video information to be frozen, determining whether the frame to be frozen is intra-coded, predictive-coded or bi-directionally predictive-coded, and if the frame is not intra-coded, converting it to become intra-coded, creating duplicate predictive-coded frames, and arranging the intra-coded frame and the duplicate predictive-coded frames into a sequence of compressed frames of video information.
In yet a further aspect of the present invention, a system for editing compressed video information over a distributed network is provided. The system includes a client computer, a network link for permitting said client computer to search for and locate compressed video information on said distributed network, and tools for editing a compressed bitstream of video information over the distributed network.