This invention relates to computers, and more particularly to improved methods and arrangements for coding/decoding object data using three-dimensional (3D) shape-adaptive discrete wavelet transform (SA-DWT) techniques.
Technology is transforming the modem economy and is also having a tremendous impact on the standard of living for ordinary people. For example, video conferencing is facilitating communication and is enabling businesses to conduct business over great distances more efficiently. The Internet is also transforming the way in which both companies and people conduct business. In particular, the Internet has increased communication between people and has provided extraordinary amounts of information at one""s fingertips.
Not only is technology transforming the economy, but it is also increasing the standard of living for ordinary people. For example, technology has changed the way in which people are entertained. Computer technology and video technology has enabled much more realistic and advanced video games. It has also improved the technical quality of movies and other video technology, and has made them more accessible to people.
Video processing is critical to all of these technologies. Video processing is the handling and manipulation of a video signal in order to achieve certain results including displaying an image on a monitor, compressing the signal for efficient storage or transmission, and manipulation of the image.
Recently, there has been a move away from frame-based coding towards object-based coding of image data. In object-based coding, a typical scene will include a plurality of visual objects that are definable in such a way that their associated image data (e.g., shape, motion and texture information) can be specially processed in a manner that further enhances the compression and/or subsequent rendering processes. Thus, for example, a person, a hand, or an automobile may be individually coded as an object. Note that, as used herein, objects may include any type of video displayable image, such as actual captured images, virtually generated images, text, etc.
Moving Picture Experts Group (MPEG) is the name of a family of standards used for coding audio-visual information (e.g., movies, video, music, etc.) in a digital compressed format. One advantage of MPEG compared to other video and audio coding formats is that MPEG files are much smaller for the same quality. This is because MPEG employs compression techniques to code frames, or as is the case in MPEG-4 to code objects as separate frame layers.
In MPEG there are three types of coded frame layers. The first type is an xe2x80x9cIxe2x80x9d or intra frame, which is a frame coded as a still image without using any past history. The second type is a xe2x80x9cPxe2x80x9d or Predicted frame, which is predicted from the most recent I frame or P frame. Each macroblock of data in a P frame can either come with a vector and difference discrete cosine transform (DCT) coefficients for a close match in the last I or P, or it can be xe2x80x9cintraxe2x80x9d coded (e.g., as in the I frames). The third type is a xe2x80x9cBxe2x80x9d or bi-directional frame, which is predicted from the closest two I frames or P frames, e.g., one in the past and one in the future. For example, a sequence of frames may be of the form, . . . IBBPBBPBBPBBIBBPBBPB . . . , which contains 12 frames from I frame to I frame. Additionally, enhancement I, P, or B frame layers may be provided to add additional refinement/detail to the image. These and other features of the MPEG standard are well known.
MPEG-4 provides the capability to further define a scene as including one or more objects. Each of these objects is encoded into a corresponding elementary data bitstream using I, P, B, and enhancement frame layers. In this manner, MPEG-4 and other similarly arranged standards can be dynamically scaled up or down, as required, for example, by selectively transmitting elementary bitstreams to provide the necessary multimedia information to a client device/application.
Unfortunately, the DCT coding scheme employed in MPEG-4 provides only limited scalability with respect to both the spatial and temporal domains. In other words, the DCT coding scheme has limited capabilities for either compressing or enlarging an image and limited capabilities for making a video run faster or slower.
More recently, DCT coding schemes are being replaced with discrete Wavelet transform (DWT) coding schemes. DWT coding takes advantage of both the spatial and the frequency correlation that exist in the image data to provide even better compression of the image data.
For a two-dimensional image array (i.e., a frame layer), image data compression using DWTs usually begins by decomposing or transforming the image into four subbands or subimages. Each subimage is one-fourth the size of the original image, and contains one-fourth as many data points as the original image. The image decomposition involves first performing a one-dimensional wavelet convolution on each horizontal pixel column of the original image, thereby dividing the image into two subimages containing low frequency and high frequency information respectively. The same or a similar convolution is then applied to each vertical pixel row of each subimage, dividing each of the previously obtained subimages into two further subimages which again correspond to low and high frequency image information.
The resulting four subimages are typically referred to as LL, LH, HL, and HH subimages. The LL subimage is the one containing low frequency information from both the vertical and horizontal wavelet convolutions. The LH subimage is the one containing low frequency image information from the horizontal wavelet convolution and high frequency image information from the vertical wavelet convolution. The HL subimage is the one containing high frequency information from the horizontal wavelet convolution and low frequency image information from the vertical wavelet convolution. The HH subimage is the one containing high frequency information from both the vertical and horizontal wavelet convolutions.
The wavelet transforms described above can be performed recursively on each successively obtained LL subimage. For the practical purposes, it has generally been found that calculating four or five decomposition levels is sufficient for most situations.
To reconstruct the original image, the inverse wavelet transform is performed recursively at each decomposition level. For example, assuming a two-level compression scheme, the second decomposition level would include a subimage LL2 that is a low resolution or base representation of the original image. To obtain a higher resolution, a subimage LL1 is reconstructed by performing an inverse wavelet transform using the subimages of the second decomposition level. The original image, at the highest available resolution, can subsequently be obtained by performing the inverse transform using the subimages of the first decomposition level (but only after obtaining subimage LL1 through an inverse transform of the second decomposition level).
The attractiveness of the wavelet approach to image compression and transmission is that subimages LH, HL, and HH contain data that can be efficiently compressed to very high compression ratios through such methods as zero-tree and arithmetic encoding.
Unfortunately, current DWT techniques also suffer from certain limitations. This is especially true for object-based coding. For example, current DWT techniques require that objects, regardless of their shape, be isolated in a bounding box (e.g., a rectangle, etc.). As such, the resulting object-based coding data will include non-object information. Since encoding the non-object information is redundant, it will require additional bits to encode it. In addition, the non-object information will likely be significantly different than the object, so the correlation for pixels located in a row or column of the bounding box will likely be reduced. Consequently, the amount of object-based coding data will likely be greater. Therefore, such object-based coding DWT techniques tend to be inefficient.
While these object-based coding DWT techniques may be suitable for some specified-quality video applications, they may not be suitable for higher quality video/multimedia applications. For example, one of the potential applications for object-based coding is to allow certain objects to be selectively scaled, or otherwise selectively processed in a special manner when compared to other objects within scene. This may require coding of additional information associated with the object. This may also require providing the capability for an object to be considered not only in terms of a spatial video environment, but also in terms of a temporal video environment.
Unfortunately, conventional DCT and DWT schemes tend to lack an efficient way of handling motion and texture across a sequence of an arbitrarily shaped video object.
Thus, there is a need for improved methods and arrangements for more efficiently compressing/decompressing object-based data in a video bitstream. Preferably, the methods and arrangements will be capable of handling motion and texture across a sequence of an arbitrarily shaped video object, and provide significant compression capabilities while not restricting the coding of additional object-based information.
In accordance with certain aspects of the present invention, improved methods and arrangements are provided for compressing and decompressing object-based data in a video bitstream. The methods and arrangements are capable of handling motion and texture across a sequence of an arbitrarily shaped video object. The methods and arrangements also provide significant compression capabilities while not restricting the coding of additional object-based information.
The above stated needs and others are met by various methods and arrangements that provide a three-dimensional (3D), shape-adaptive discrete wavelet transform (SA-DWT) for object-based video coding.
In accordance with certain aspects the 3D SA-DWT selectively performs a temporal SA-DWT along a temporal direction among pixels that share a temporal correspondence with respect to a sequence of frames. The temporal correspondence can be established, for example, by applying motion estimation or other like matching techniques. The resulting temporal SA-DWT is used to treat emerging pixels, continuing pixels, terminating pixels, and colliding correspondence pixels associated with one or more objects included within the sequence of frames. The objects may include images of actual physical objects, images of virtual objects, and/or a combination of the two.
Following the temporal SA-DWT, the resulting temporal SA-DWT coefficients are placed in spatial positions within arrays that correspond to the original positions of the associated pixels within each of the video frames. Subsequently, a spatial or two-dimensional (2D) SA-DWT is applied to the temporal SA-DWT coefficients within each array.
The resulting 3D SA-DWT coefficients may then be further processed, as needed, to meet the requisite compression, scalability, etc., associated with a desired use, storage and/or transportation action. A corresponding inverse 3D SA-DWT can later be performed to substantially reproduce the original video sequence.