The present invention provides efficient motion estimation for an arbitrarily-shaped object for use in an object-based digital video coding system.
Object manipulation is one of the desirable features for multimedia applications. This functionality is available in the developing digital video compression standards, such as H.263+ and MPEG-4. For H.263+, refer to ITU-T Study Group 16, Contribution 999, Draft Text of Recommendation H.263 Version 2 (xe2x80x9cH.263+xe2x80x9d) for Decision, Sep. 1997, incorporated herein by reference. For MPEG-4, refer to ISO/IEC 14496-2 Committee Draft (MPEG-4), xe2x80x9cInformation Technologyxe2x80x94Coding of audio-visual objects: visual,xe2x80x9d JTC1/SC29/WG11 N2202 March 1998, incorporated herein by reference.
MPEG-4 uses a shape coding tool to process an arbitrarily shaped object known as a Video Object Plane (VOP). With shape coding, shape information, referred to as alpha planes, is obtained. Binary alpha planes are encoded by modified Content-based Arithmetic Encoding (CAE), while grey-scale alpha planes are encoded by a motion compensated Discrete Cosine Transform (DCT), similar to texture coding. An alpha plane is bounded by a rectangle that includes the shape of the VOP (Intelligent VOP formation). The bounding rectangle of the VOP is extended on the right-bottom side to multiples of 16xc3x9716 blocks, and the extended alpha samples are set to zero. The extended alpha plane is partitioned into blocks of 16xc3x9716 samples (e.g., alpha blocks) and the encoding/decoding process is performed on each alpha block.
Moreover, compression of digital video objects is important in view of the bandwidth-limited channels over which such data is communicated. In particular, motion compensation is the most popular tool to reduce the temporal redundancy in video compression.
Motion estimation and motion compensation (ME/MC) generally involve matching a block of a current video frame (e.g., a current block) with a block in a search area of a reference frame (e.g., a predicted block or reference block). For predictive (P) coded images, the reference block is in a previous frame. For bi-directionally predicted (B) coded images, predicted blocks in previous and subsequent frames may be used. The displacement of the predicted block relative to the current block is the motion vector (MV), which has horizontal (x) and vertical (y) components. Positive values of the MV components indicate that the predicted block is to the right of, and below, the current block.
A motion compensated difference block is formed by subtracting the pixel values of the predicted block from those of the current block point by point. Texture coding is then performed on the difference block. The coded MV and the coded texture information of the difference block are transmitted to the decoder. The decoder can then reconstruct an approximated current block by adding the quantized difference block to the predicted block according to the MV.
Efficiency of motion compensation depends greatly on the quality. of its encoding counterpart, motion prediction. Exhaustive search motion estimation is the most reliable method to predict the motion vector. However, this method suffers from its huge degree of complexity.
Many sub-optimum solutions have been proposed to alleviate the complexity of motion estimation. Most of them sacrifice the search quality to reduce the number of searches.
Full search motion estimation performs a search (also called block matching) for the block inside the search area in the reference picture that best describes the current block in the current picture. The displacement between the best-matched block and the current block, indicated by a motion vector, is later used in the motion compensation process to recover the current block. In other words, a block, B(z,t), at spatial position z and time t will be replaced by another block, Bxe2x80x2 (zxe2x80x2,txe2x80x2), at position zxe2x80x2 in the reference picture at time txe2x80x2, and with a time difference, "igr" (=txe2x88x92txe2x80x2). The motion vector MV(z,t) in this case is the displacement between zxe2x80x2 and z. Hence,
B(z,t)=B(zxe2x80x2,txe2x80x2)=B(zxe2x88x92MV(z,t),txe2x88x92"igr");
and
MV(z,t)=min(D(B(z,t), B(zxe2x88x92MV(z,t),txe2x88x92"igr"))), ∀zxe2x80x2xcex5 search area around z.
Moreover, D(B(z,t),B(zxe2x80x2,txe2x80x2)) is the prediction error, where xe2x80x9cDxe2x80x9d is a xe2x80x9cdeltaxe2x80x9d. The error can be first order, e.g., an absolute difference, second order, e.g., a square difference, or any higher order. However, the complexity of the calculations increases with higher orders of the prediction error.
Motion estimation is a computationally intensive process. The contribution from all pixels in the block has to be considered in the prediction error calculation. Furthermore, all possible blocks in the search area are also needed to be matched in order to obtain a reliable motion vector. In general, a total of (2n+1)2m2 comparisons is involved in a motion estimation of an mxc3x97m block with the search area of xc2x1n pixels. For example, 278,784 pixel comparisons or 1,089 block searches are required for m,n=16.
Moreover, motion estimation for arbitrarily-shaped video object presents still further challenges.
There are various simpler alternatives to full search block matching in the literature. Most of them use a coarse-to-fine search strategy, e.g.; hierarchical motion estimation, a three-step search, a logarithm search, and so forth. These fast search algorithms subsample the reference picture into various scales and perform a full search starting from the coarsest scale. The subsequent searches, which occur at the finer scale, are limited to the surrounding pixel of the previous motion vector. The same process is repeated until the final result at the actual scale is obtained. However, these modifications are sub-optimum since they may choose only a locally optimal solution, and they generally use a full search method as their benchmark.
Accordingly, it would be desirable to provide an improved, more efficient shape and texture motion estimation system for digital video objects. The system should exploit the irregular boundary of the object to reduce the number of searches. The system should also be general enough to apply with any fast block matching alternative. The system should be applicable to arbitrarily-shaped video coding algorithms, such as MPEG-4.
The system should provide a shaped search area that follows a shape of the video object being coded.
The system should be useable in an MPEG-4 encoder or other object-based encoder.
The present invention provides a system having the above and other advantages.
The invention relates to an efficient motion estimation technique for an arbitrarily-shaped video object that reduces the number of searches for motion estimation for shape coding and texture coding. The invention is particularly suitable for use in an MPEG-4 encoder for coding Video Object Planes (VOPs).
Essentially, the invention provides a technique for shaping the search area for motion estimation according to the shape of the video object being coded.
A method for motion estimation coding of an arbitrarily-shaped video object includes the step of: determining whether successive blocks of pixels of at least a portion of the reference video image are outside the video object, overlap the video object, or are inside the video object. Each block, such as an mxm block, has a respective reference pixel and a plurality of associated neighboring pixels.
Respective mask values corresponding to positions of the respective reference pixels in the reference video image are provided according to whether the associated blocks are outside the video object, overlap the video object, or are inside the video object. The respective mask values indicate a search region in the reference video image for motion-estimation coding of the video object that corresponds to a shape of the video object
The successive blocks are outside the video object when alpha plane values of the pixels in the block indicate an absence of the video object.
The successive blocks overlap the video object when alpha plane values of at least one of the pixels in the block indicate an absence of the video object, and alpha plane values of at least another one of the pixels in the block indicate a presence of the video object.
Moreover, the successive blocks are inside the video object when alpha plane values of the pixels in the block indicate a presence of the video object.
The respective mask values indicate a search region for (binary) shape motion estimation coding of the video object when the associated blocks overlap the video object.
More particularly, the respective mask values indicate a search region for shape motion estimation coding of the video object when the associated blocks overlap the video object, but are not outside or inside the video object in the reference video image.
The respective mask values indicate a search region for texture motion estimation coding of the video object when the associated blocks overlap the video object, and are inside the video object.
More particularly, the respective mask values indicate a search region for texture motion estimation coding of the video object when the associated blocks overlap the video object, and are inside the video object, but are not outside the video object.
The video object may comprise at least first and second Video Object Planes (VOPs), where the first VOP is in the reference video image, and the second VOP is in a current video image that uses the search region for motion estimation coding.
The respective reference pixel may be a top, left pixel in each of the successive blocks.
When a common search range is used for both shape and texture motion estimation coding of the video object, the respective mask values may be set for each of the blocks to indicate the search region for texture motion estimation by ORing: (a) the alpha plane values of the respective reference pixels of the blocks which indicate the search region for shape motion estimation with (b) the alpha plane values of the respective reference pixels of the blocks which are inside the video object in the reference video image.
Various options exist to improve the efficiency of the invention. For example, it is possible to examine alpha plane values of the pixels in a first block, store data corresponding to at least a portion of the alpha plane values which overlap a second block, and retrieve the stored data for use in determining whether the second block is outside, overlapping, or inside the video object. This avoids the need to repetitively determine the alpha plane values of the same pixels.
Additionally, if the video object is known or assumed to have a minimum size, further efficiencies can be achieved by subsampling the pixels of the reference video image according to a minimum size of the video object prior to determining whether the blocks are outside, overlapping, or inside the video object.
Or, without the need for subsampling, it is possible to examine the alpha plane values of only a portion of the pixels in the blocks (such as the outer pixels in a block) according to a minimum size of the video object.
A corresponding apparatus is also presented.