The present invention relates to the field of digital video coding technology and, more particularly, to a method and apparatus for providing an improved chroma-key shape representation of video objects of arbitrary shape.
A variety of protocols for communication, storage and retrieval of video images are known. Invariably, the protocols are developed with a particular emphasis on reducing signal bandwidth. With a reduction of signal bandwidth, storage devices are able to store more images and communications systems can send more images at a given communication rate. Reduction in signal bandwidth increases the overall capacity of the system using the signal.
However, bandwidth reduction may be associated with particular disadvantages. For instance, certain known coding systems are lossy because they introduce errors which may affect the perceptual quality of the decoded image. Others may achieve significant bandwidth reduction for certain types of images but may not achieve any bandwidth reduction for others. Accordingly, the selection of coding schemes must be carefully considered.
The Motion Picture Expert Group (MPEG) has successfully introduced two standards for coding of audiovisual information, known by acronyms as MPEG-1 and MPEG-2. MPEG is currently working on a new standard, known as MPEG-4. MPEG-4 video aims at providing standardized core technologies allowing efficient storage, transmission and manipulation of video data in multimedia environments. A detailed proposal for MPEG-4 is set forth in MPEG-4 Video Verification Model (VM) 5.0, hereby incorporated by reference.
MPEG-4 considers a scene to be a composition of video objects. In most applications, each video object represents a semantically meaningful object. Each uncompressed video object is represented as a set of Y, U, and V components (luminance and chrominance values) plus information about its shape, stored frame after frame in predefined temporal intervals. Each video object is separately coded and transmitted with other objects. As described in MPEG-4, a video object plane (VOP) is an occurrence of a video object at a given time. For a video object, two different VOPs represent snap shots of the same video object at two different times. For simplicity we have often used the term video object to refer to its VOP at a specific instant in time.
As an example, FIG. 1(A) illustrates a frame for coding that includes a head and shoulders of a narrator, a logo suspended within the frame and a background. FIGS. 1(B)-1(D) illustrate the frame of FIG. 1(A) broken into three VOPs. By convention, a background generally is assigned VOPØ. The narrator and logo may be assigned VOP1 and VOP2 respectively. Within each VOP, all image data is coded and decoded identically.
The VOP encoder for MPEG-4 separately codes shape information and texture (luminance and chrominance) information for the video object. The shape information is encoded as an alpha map that indicates whether or not each pixel is part of the video object. The texture information is coded as luminance and chrominance values. Thus, the VOP encoder for MPEG-4 employs explicit shape coding because the shape information is coded separately from the texture information (luminance and chrominance values for each pixel). While an explicit shape coding technique can provide excellent results at high bit rates, explicit shape coding requires additional bandwidth for carrying shape information separate from texture information. Moreover, results are unimpressive for the explicit shape coding at low coding bit rates because significant bandwidth is occupied by explicit shape information, resulting in low quality texture reconstruction for the object.
As an alternative to explicitly coding shape information, implicit shape coding techniques have been proposed in which shape information is not explicitly coded. Rather, in implicit shape coding, the shape of each object can be ascertained based on the texture information. Implicit shape coding techniques provide a simpler design (less complex than explicit technique) and a reasonable performance, particularly at lower bit rates. Implicit shape coding reduces signal bandwidth because shape information is not explicitly transmitted. As a result, implicit shape coding can be particularly important for low bit rate applications, such as mobile and other wireless applications.
However, implicit shape coding generally does not perform as well as explicit shape coding, particularly for more demanding scenes. For example, objects often contain color bleeding artifacts on object edges when using implicit shape coding. Also, it can be difficult to obtain lossless shapes using the implicit techniques because shape coding quality is determined by texture coding quality and is not provided explicitly. Therefore, a need exists for an improved implicit shape coding technique.
The system of the present invention can include an encoding system and a decoding system that overcomes the disadvantages and drawbacks of prior systems.
An encoding system uses chroma-key shape coding to implicitly encode shape information with texture information. The encoding system includes a boundary box generator and color replacer, a DCT encoder, a quantizer, a motion estimator/compensator and a variable length coder. A video object to be encoded is enclosed by a bounding box and only macroblocks in the bounding box are processed to improve data compression. Each macroblock inside the bounding box is identified as either 1) outside the object; 2) inside the object; or 3) on the object boundary. Macroblocks outside the object are not coded to further improve data compression. For boundary macroblocks, pixels located outside the object (background pixels) are replaced with a chroma-key color K to implicitly encode the shape of the object. The luminance and chrominance values for macroblocks inside the object and on the object boundary are coded, including transforming the luminance and chrominance values to obtain DCT coefficients, and quantizing (scaling) the DCT coefficients. Motion compensation can also be performed on some macroblocks to generate motion vectors. In addition, to improve image quality, boundary macroblocks can be quantized at a finer level than other macroblocks in the bounding box. A bitstream is output from the encoding system. The bitstream can include the encoded macroblock pixel data, a code identifying the position (e.g., inside, outside or on the boundary) of each coded macroblock, the chroma-key value and thresholds, motion vectors and one or more quantizers. Where a finer quantization is applied to boundary macroblocks, the bitstream also includes a code indicating the exact quantizer used for boundary macroblocks and a code indicating the number of quantization levels for macroblocks inside the object.
A decoding system includes a variable length decoder, an inverse quantizer, a motion compensator, an inverse DCT coder, and color extractor and shape mask detector. A bitstream is received and decoded by the decoding system to obtain both texture information (e.g., luminance and chrominance data) and shape information for a video object. The shape information is implicitly encoded. DCT coefficients and motion vectors for each macroblock are inverse quantized (rescaled) based on the codes (quantizers) identifying the specified quantizer or the specified number of quantization levels for each. The reconstructed video object is obtained by passing only the pixel values for the object (e.g., by rejecting pixel values within a predetermined range of the chroma-key). The shape of the video object is obtained by generating a binary map or shape mask (e.g., 1s or 0s) identifying each pixel as either inside the object or outside the object A gray-scale map (shape mask) can be generated instead by using two thresholds to soften the object boundaries.