This invention relates generally to estimating rate-distortion, and more particularly, to the estimating the rate-distortion characteristics of binary shape data in a video sequence.
Recently, a number of standards have been developed for communicating visual information. For digital images, the best known standard is JPEG, see Pennebacker et al., xe2x80x9cJPEG Still Image Compression Standard,xe2x80x9d Van Nostrand Reinhold, 1993. For video sequences, the most widely used standards include MPEG- 1 (for storage and retrieval of moving pictures), MPEG-2 (for digital television) and H.263, see ISO/IEC JTC1 CD 11172, MPEG, xe2x80x9cInformation Technologyxe2x80x94Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/sxe2x80x94Part 2: Coding of Moving Pictures Information,xe2x80x9d 1991, LeGall, xe2x80x9cMPEG: A Video Compression Standard for Multimedia Applications,xe2x80x9d Communications of the ACM, Vol. 34, No. 4, pp. 46-58, 1991, ISO/IEC DIS 13818-2, MPEG-2, xe2x80x9cInformation Technologyxe2x80x94Generic Coding of Moving Pictures and Associated Audio Informationxe2x80x94Part 2: Video,xe2x80x9d 1994, ITU-T SG XV, DRAFT H.263, xe2x80x9cVideo Coding for Low Bitrate Communication,xe2x80x9d 1996, ITU-T SG XVI, DRAFT13 H.263+Q15-A-60 rev.0, xe2x80x9cVideo Coding for Low Bitrate Communication,xe2x80x9d 1997.
These standards are relatively low-level specifications that primarily deal with spatial compression in the case of images, and spatial and temporal compression for video sequences. As a common feature, these standards perform compression on a per frame basis. With these standards, one can achieve high compression ratios for a wide range of applications.
Newer video coding standards, such as MPEG-4 (for multimedia applications), see xe2x80x9cInformation Technologyxe2x80x94Generic coding of audio/visual objects,xe2x80x9d ISO/IEC FDIS 14496-2 (MPEG4 Visual), Nov. 1998, allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP). The objects can be visual, audio, natural, synthetic, primitive, compound or combinations thereof.
This emerging standard is intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. For example, one might want to xe2x80x9ccut-and-pastexe2x80x9d a moving figure or object from one video to another. In this type of application, it is assumed that the objects in the multimedia content have been identified through some type of segmentation process, see for example, U.S. patent application Ser. No. 09/326,750 xe2x80x9cMethod for Ordering Image Spaces to Search for Object Surfacesxe2x80x9d filed on Jun. 4, 1999 by Lin et al.
The emergence of the MPEG-4 standard has provoked a great deal of interest in object-based encoding methodologies. One of the key requirements for object-based encoding is an efficient and flexible means for coding the shape of objects. The MPEG standard has adopted a context-based arithmetic encoding (CAE) process for this purpose. For compatibility with texture coding, this process has been modified to operate at the macroblock level. A macroblock is a 16xc3x9716 group of pixels in an image or frame.
For the coding of texture, a variety of models exist. These models provide a relation between the rate and distortion that can be achieved, see for example, Chiang et al. xe2x80x9cA new rate control scheme using quadratic rate distortion modeling,xe2x80x9d IEEE Trans. Circuits and Systems for Video Technology, February 1997, and Hang et al. xe2x80x9cSource model for transform video coder and its applicationxe2x80x94Part I: Fundamental theory,xe2x80x9d IEEE Trans. Circuits and Systems for Video Technology, April 1997.
These models are most useful for rate control and have been successfully been applied to frame-based video coding. Given some bit budget for a frame, one can find a quantizer value that meets a specified constraint on the rate. Additionally, such models can be used to analyze the source or sources to be encoded in an effort to optimize coding in a computationally efficient way. In the case of shape coding, however, no such models exist.
The relationship between the rate and distortion is very different. The reason for this difference is due to the techniques used to code each type of data. In the MPEG standards, texture is coded by first partitioning the data into disjoint macroblocks. The data in these macroblocks are decorrelated using the well-known Discrete Cosine Transform (DCT), which has the property of mapping the signal energy into a small number of coefficients. From this frequency domain, loss may be introduced by quantizing the DCT coefficients. In this process, some high frequency coefficients may become zero. At this point, the 2D macroblock of quantized DCT coefficients are organized into a 1D vector using a zigzag scanning pattern. The run-lengths of these coefficients are then entropy coded using a Huffman look-up table. In this way, long zero run- lengths can be efficiently encoded. Signal variance and the quantizer value play a major role in the final energy of the DCT coefficients. Consequently, variance-like measures have been widely used as the observed data or input for rate-distortion (R-D) or rate-quantizer models.
In the MPEG-4 standard, the shape data are also partitioned into disjoint macroblocks. As with texture, the macroblocks can be encoded using several modes. For simplicity, the intra mode is only described. In this mode, three different types of blocks are considered: transparent, opaque, and border blocks. Transparent and opaque blocks are signaled as a macroblock type. For the border blocks, a template of 10 pixels is used to define the casual context for predicting the shape value of a current pixel. FIG. 1 shows an intra-context template of ten pixels (c0, . . . , c9) 100, and a current pixel x 101. Note, the specific arrangement of the ten neighborhood pixels in rows of three, five, and two pixels, and the location of the current pixel with respect to the template.
A context C for the current pixel is determined according to:   C  =            ∑      k        ⁢                  c        k            ·              2        k            
Typically, the context C ranges from 0 to 1023. The context is used to index a probability table to obtain a sequence of probabilities that are used to drive an arithmetic encoder.
When shape macroblocks are coded at full-resolution (16xc3x9716 pixels), this algorithm is able to achieve a lossless representation. To reduce the bit-rate, distortion can be introduced through successive down-sampling of the original macroblock by a factor of two, four, more. In this case, the subsampling factor is transmitted along with the subsampled data, and at the decoder end, the data are upsampled back to the full-resolution.
There are two major differences between the texture and shape coding. The first difference is the entropy coding process. Texture coding uses a Huffman table to assign variable length codes to quantized DCT coefficient run-lengths, while shape coding computes a context for every pixel and associates a probability that the pixel is either zero or one. The second difference is in the way that distortion is introduced. Texture coding quantizes the DCT-domain coefficients, while shape coding down-samples the data.
Because of these differences, new methods are required to estimate the rate-distortion characteristics of object shape.
The invention provides a method that estimates rate and distortion characteristics of a video object. First and second object shape features are respectively extracted at a first and second resolution of the video object. First and second rate distortion characteristics of the video object are respectively determined from the extracted first and second object shape features according to first and second modeling parameters. The extracted object shape features can be discrete, such as states of binary shape patterns of the video object, or the object shape features can be continuous such as a set of statistical moments representing a probability density function of the video object.
In one aspect of the invention the video object is segmented into macroblocks, and the extracting and determining steps are performed for each of the macroblocks, and the second resolution can be a downsampling of the first resolution. Alternatively, the second object shape features can be predicted from the first object shape features without performing the downsampling. Typically, the modeling parameters are acquired from a set of training video objects. The invention enables object based video encoders and transcoders, and optimal video object segmentation.