I. Field of the Invention
The present invention relates to techniques for compression of object based digital image information, and more specifically, to encoding shape information as part of an MPEG-4 or similar type coding scheme.
II. Description of the Related Art
In recent years, the field of digital video has increasingly focused on object based compression schemes. While existing commercial video compression techniques encode video (and associated audio) information based on pixel or frames, object based compression procedures seek to encode individual video objects, such as persons, faces and other items in a scene, in order to generate a more robust representation of such video information in the compressed domain and to permit both flexibility in scene composition and interactivity with scene content.
In particular, the International Standardization Organization (ISO) has recently adopted the MPEG-4 video compression standard, as described in ISO document ISO/IEC JTC1/SC29/WG11N2201 (May 15, 1998), the disclosure of which is incorporated by reference herein. The MPEG-4 standard specifies the syntax of a compliant bitstream, rather than the actual encoding process or any hardware used in that process, and hence references in the foregoing disclosure to MPEG-4 encoding are made with the understanding that although the specifics of such encoding are open ended, the generated bitstream must be MPEG-4 compliant.
Under the MPEG-4 rubric, the parlance of Video Objects ("VOs") and Video Object Planes ("VOPs") are used to represent encoded information. VOs correspond to independently coded entities that can be accessed and manipulated by a user, e.g., by cut and paste, and which are saved as separate bitstreams. VOPs correspond to instances of a VO at a particular time.
An MPEG-4 video encoder includes two main parts, a so called shape coder and a more traditional texture coder, both of which operate on the same VOPs. While the shape coder defines the outer border of a VO at a particular time, the texture coder encodes color information for pixels within that VO.
Both the texture coder itself and motion estimation circuitry within the MPEG-4 encoder operate similarly to coders and circuitry used in other state-of-the-art standards, such as MPEG-2. However, although the problem of shape representation have been previously investigated in the fields of computer vision, image understanding, image compression and computer graphics, MPEG-4 represents the first standardization effort which mandates the adoption of a shape representation technique in an encoding process.
In MPEG-4, two types of shape data are used in the compression process: grey scale and binary shape information. Binary shape information is set to a constant value within any VOP, so that any pixel within the VOP is assigned one value while pixels outside of the VOP are assigned a different value. The grey scale shape information assigns to each pixel in such a VOP a value from a range of possible values, usually 0 to 255, that represent the degree of transparency of that pixel. Thus, if the binary shape data for pixels within a VOP is set to 255, the grey scale data will represent an offset to 255, with a 0 offset value meaning that the pixel is in full view and increasing offset values meaning that the VO corresponding to the VOP is either fading in or fading out from the scene.
State of the art video compressiontechniques such as MPEG-2 divide frames of video information into macroblocks, which in turn form the basis of the compression algorithm. In MPEG-4 parlance, Binary Alpha Planes of binary shape information are divided into an array of Binary Alpha Blocks ("BABs") 120. Referring to FIG. 1, each BAB 100 is defined as a set of 16.times.16 binary pixels 110, and is comprised of a 4.times.4 array of pixel blocks ("PBs") 120, which themselves are 4.times.4 arrays of pixels 110.
An important step in the MPEG-4 compression process focuses on reducing the data needed to adequately represent shape information. Referring to FIG. 2, current rate control and rate reduction in MPEG-4 is realized through size conversion of the binary alpha information. An original BAB 210 is downsampled at a certain conversion ratio ("CR") value to generate a size compressed BAB 220. The compressed BAB is then upsampled at the same CR value to generate an approximated binary alpha block BAB' 230. The distortion between the original BAB 210 and the approximated binary alpha block BAB' 230 is measured. In this size conversion process, the determination of an acceptable CR is effected based on a given distortion threshold alpha.sub.-- th. That is, it is necessary to ascertain whether the compression of a certain BAB at a specific size conversion CR yields acceptable quality.
Given the current original BAB and an approximation of the BAB at a particular size conversion value, BAB', an acceptable quality function ACQ may be defined as shown in equation (1): EQU ACQ(BAB')=MIN(acq.sub.1, acq.sub.2, . . . , acq.sub.16,) (1)
where ##EQU1## and SAD.sub.-- PB, (BAB, BAB') is defined as the sum of absolute differences for PB.sub.i, where an opaque pixel has the value 255 and a transparent pixel has the value 0. The parameter alpha.sub.-- th has values {0,16,32,64, . . . , 256}. If alpha.sub.-- th =0 then encoding will be lossless. A value of alpha.sub.-- th=256 means that the accepted distortion is at a maximum.
In this process, consider the down-sampling case first. For CR=1/2, if the average pixel value in a 2.times.2 pixel block is equal to or greater than 128, the pixel value of the down-sampled block is set to 255, otherwise to 0. For CR=1/4, if the average pixel value in a 4.times.4 pixel block is equal to or greater than 128, the pixel value of the down-sampled block is set to 255, otherwise to 0.
Up-sampling is carried out for any BAB having a CR other than 1. The value of an interpolated pixel is determined by examining its neighboring pixels. For the pixel value calculation, the value of "0" is used for a transparent pixel, and "1" for an opaque pixel. Referring to FIG. 3, pixels A, B, C, D, E, F, G, H, I, J, K, L, are each assigned different index values to coefficient c.sub.k, and are used in the generation of values of the interpolated pixels at the top-left point (P1) 310, top-right point (P2) 320, bottom-left point (P3) 330, and bottom-right point (P4) 340, as given by equations 2-5: EQU P1: if(4.times.A+2.times.(B+C+D)+(E+F+G+H+I+J+K+L)&gt;Th[C.sub.f ]) then `1` else "0". (2) EQU P2: if(4.times.B+2.times.(A+C+D)+(E+F+G+H+I+J+K+L)&gt;Th[C.sub.f ]) then `1` else "0". (3) EQU P3: if(4.times.C+2.times.(A+C+D)+(E+F+G+H+I+J+K+L)&gt;Th[C.sub.f ]) then `1` else "0". (4) EQU P4: if(4.times.D+2.times.(A+C+D)+(E+F+G+H+I+J+K+L)&gt;Th[C.sub.f ]) then `1` else "0". (5)
where the 8-bit filter coefficient, C.sub.f, is given by equation (6): ##EQU2##
In equations 2-6, representative values of c.sub.k are used for C.sub.f, and values of lettered pixel values are used for the interpolated pixels P1, P2, P3 and P4. Note that the two representations imply the same value, e.g., for interpolated pixel P1, the values of F and c.sub.0, E and c.sub.1, L and c.sub.2, K and C.sub.3, J and c.sub.4, I and c.sub.5, H and c.sub.6, and G and c.sub.7, are all the same values. Based on the calculated C.sub.f, the threshold value Th[C.sub.f,] can be obtained from the look-up table as shown in the MPEG-4 Verification Model Version 8.0, ISO/IEC JTC1/SC29/WG11 (July 1997), as those skilled in the art will appreciate.
Note that after interpolation, the pixels in the low-resolution image, e.g., pixels A-L, are not contained in the upsampled image, i.e., all pixels in the upsampled image are interpolated. When the BAB is on the left and/or top border of the VOP, the left and/or top borders are extended from the outermost pixels inside the BAB. In the case that CR=1/4, the BAB is first interpolated into the size corresponding to CR=1/2, then interpolated into the full size, i.e., representing CR=1. The decision to accept a certain CR value for a specific BAB is based on ACQ(BAB'). If the quality is acceptable, that CR is adopted for that BAB. Once the value of CR is determined, size conversion is effected with that CR value for the BAB.
After the size conversion process of FIG. 2 has been accomplished, each BAB is further compressed according to one of seven different "lossless" modes, which utilize shape motion vectors ("MVs") and a shape motion vector predictors ("MVPs") to perform context-based arithmetic encoding("CAE"). In the listed lossless compression modes, the quantity called motion vector difference of shape ("MVDs") is determined with reference to both MVs and MVPs such that MVDs=MVs-MVPs.
1. MVDs==0 && No Update PA1 2. MVDs!=0 && No Update PA1 3. all.sub.-- 0 PA1 4. all.sub.-- 255 PA1 5. intraCAE PA1 6. MVDs==0 && interCAE PA1 7. MVDs?=0 && interCAL
In intra-coded VOP's ("I-VOPs"), only the coding modes, "all.sub.-- 0", "all.sub.-- 255" and "ntra-CAE" are allowed. MVPs are determined by referring certain candidate MVs and texture motion vectors around the macroblock that corresponds to the current shape block. FIG. 4 illustrates the locations of such candidate macroblocks. In FIG. 4, for a current macroblock 410, three candidate texture motion vectors that have been rounded to integer valeus 411, 412, 414, denoted MV1, MV2 and MV3, and three corresponding shape motion vectors 421, 422, 423 designated as MVs1, MVs2, MVs3 are show. By looking at MVs1, MVs2, MVs3, MV1, MV2 and MV3 in this order, MVPs is determined by taking the first encountered MV that is valid. If no candidate MV is valid, MVPs is regarded as 0.
Based on the MVPs determined above, MVs is computed by the following procedure: The MC error is computed by comparing the predicted BAB (by MVPs) and current BAB. If the computed MC error is less or equal to 16.times.AlphaTH, where AlphaTH is a threshold used when comparing two 4.times.4 sub-blocks, the MVPs is directly employed as MVs, and the procedure terminates. If the above condition is not satisfied. MV is searched around the prediction vector MVPs. The search range is +/-16 pixels around MVPs along both horizontal and vertical directions. The MV that minimizes the SADs is taken as MVs and this is further interpreted as MVDs for shape.
While the current MPEG-4 shape coding technique described above takes into account the quality of the compressed bitstream in both the size and lossless phases of compression, it fails to address in any manner actual buffer constraints. Thus, although the existing technique may generate a bitstream having an acceptable level of quality, the bitstream may be useless in the event that the encoder's buffer has reached an overflow condition. Thus, there exists a need for an MPEG-4 shape coding technique which both generates a bistream having acceptable quality and takes into consideration actual buffer constraints.