Full-motion video displays based upon analog video signals have long been available in the form of television. With recent advances in computer processing capabilities and affordability, full-motion video displays based upon digital video signals are becoming more widely available. Digital video systems can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, and playing full-motion video sequences.
Digital video displays include large numbers of image frames that are played or rendered successively at frequencies of between 30 and 75 Hz. Each image frame is a still image formed from an array of pixels based on the display resolution of a particular system. As examples, VHS-based systems have display resolutions of 320.times.480 pixels, NTSC-based systems have display resolutions of 720.times.486 pixels, and high-definition television (HDTV) systems under development have display resolutions of 1360.times.1024 pixels.
The amounts of raw digital information included in video sequences are massive. Storage and transmission of these amounts of video information is infeasible with conventional personal computer equipment. Consider, for example, a digitized form of a relatively low resolution VHS image format having a 320.times.480 pixel resolution. A full-length motion picture of two hours in duration at this resolution corresponds to 100 gigabytes of digital video information. By comparison, conventional compact optical disks have capacities of about 0.6 gigabytes, magnetic hard disks have capacities of 1-2 gigabytes, and compact optical disks under development have capacities of up to 8 gigabytes.
To address the limitations in storing or transmitting such massive amounts of digital video information, various video compression standards or processes have been established, including MPEG-1, MPEG-2, and H.26X. These video compression techniques utilize similarities between successive image frames, referred to as temporal or interframe correlation, to provide interframe compression in which motion data and error signals are used to encode changes between frames.
In addition, the conventional video compression techniques utilize similarities within image frames, referred to as spatial or intraframe correlation, to provide intraframe compression in which the image samples within an image frame are compressed. Intraframe compression is based upon conventional processes for compressing still images, such as discrete cosine transform (DCT) encoding. This type of coding is sometimes referred to as "texture" or "transform" coding. A "texture" generally refers to a two-dimensional array of image sample values, such as an array of chrominance and luminance values or an array of alpha (opacity) values. The term "transform" in this context refers to how the image samples are transformed into spatial frequency components during the coding process. This use of the term "transform" should be distinguished from a geometric transform used to estimate scene changes in some interframe compression methods.
Interframe compression typically utilizes motion estimation and compensation to encode scene changes between frames. Motion estimation is a process for estimating the motion of image samples (e.g., pixels) between frames. Using motion estimation, the encoder attempts to match blocks of pixels in one frame with corresponding pixels in another frame. After the most similar block is found in a given search area, the change in position of the pixel locations of the corresponding pixels is approximated and represented as motion data, such as a motion vector. Motion compensation is a process for determining a predicted image and computing the error between the predicted image and the original image. Using motion compensation, the encoder applies the motion data to an image and computes a predicted image. The difference between the predicted image and the input image is called the error signal. Since the error signal is just an array of values representing the difference between image sample values, it can be compressed using the same texture coding method as used for intraframe coding of image samples.
Although differing in specific implementations, the MPEG-1, MPEG-2, and H.26X video compression standards are similar in a number of respects. The following description of the MPEG-2 video compression standard is generally applicable to the others.
MPEG-2 provides interframe compression and intraframc compression based upon square blocks or arrays of pixels in video images. A video image is divided into image sample blocks called macroblocks having dimensions of 16.times.16 pixels. In MPEG-2, a macroblock comprises four luminance blocks (each block is 8.times.8 samples of luminance (Y)) and two chrominance blocks (one 8.times.8 sample block each for Cb and Cr).
In MPEG-2, interframe coding is performed on macroblocks. An MPEG-2 encoder performs motion estimation and compensation to compute motion vectors and block error signals. For each block M.sub.N in an image frame N, a search is performed across the image of a next successive video frame N+1 or immediately preceding image frame N-1 (i.e., bi-directionally) to identify the most similar respective blocks M.sub.N+1 or M.sub.N-1. The location of the most similar block relative to the block M.sub.N is encoded with a motion vector (DX,DY). The motion vector is then used to compute a block of predicted sample values. These predicted sample values are compared with block M.sub.N to determine the block error signal. The error signal is compressed using a texture coding method such as discrete cosine transform (DCT) encoding.
Object based video coding techniques have been proposed as an improvement to the conventional frame based coding standards. In object based coding, arbitrary shaped image features are separated from the frames in the video sequence using a method called "segmentation." The video objects or "segments" are coded independently. Object based coding can improve the compression rate because it increases the interframe correlation between video objects in successive frames. It is also advantageous for variety of applications that require access to and tracking of objects in a video sequence.
In the object based video coding methods proposed for the MPEG-4 standard, the shape, motion and texture of video objects are coded independently. The shape of an object is represented by a binary or alpha mask that defines the boundary of the arbitrary shaped object in a video frame. The motion of an object is similar to the motion data of MPEG-2, except that it applies to an arbitrary-shaped image of the object that has been segmented from a rectangular frame. Motion estimation and compensation is performed on blocks of a "video object plane" rather than the entire frame. The video object plane is the name for the shaped image of an object in a single frame.
The texture of a video object is the image sample information in a video object plane that falls within the object's shape. Texture coding of an object's image samples and error signals is performed using similar texture coding methods as in frame based coding. For example, a segmented image can be fitted into a bounding rectangle formed of macroblocks. The rectangular image formed by the bounding rectangle can be compressed just like a rectangular frame, except that transparent macroblocks need not be coded. Partially transparent blocks are coded after filling in the portions of the block that fall outside the object's shape boundary with sample values in a technique called "padding."
Frame based coding techniques such as MPEG-2 and H26X and object based coding techniques proposed for MPEG-4 are similar in that they perform intraframe and interframe coding on macroblocks. The interframe coding format for these techniques uses a special bit to indicate whether the interframe macroblock is coded. This special bit is sometimes called the COD bit or the "not coded" bit. To be consistent, we refer to this type of parameter as a COD bit or COD parameter. The COD bit indicates whether or not the encoded macroblock includes motion data and texture coded error data. In cases where the motion and error signal data is zero, the COD bit reduces the information needed to code the macroblock because only a single bit is sent rather than additional bits indicating that the motion vector and texture data is not coded.
In addition to the COD bit, the coding syntax for macroblocks includes coded block parameters (CBP) indicating whether the coded transform coefficients for chrominance and luminance are transmitted for the macroblock. If the transform coefficients are all zero for a block, then there is no need to send texture data for the block. The Coded Block Parameters for chrominance (CBPC) are two bits indicating whether or not coded texture data is transmitted for each of the two chrominance blocks. The Coded Block Pattern for luminance (CBPY) are four bits indicating whether or not coded texture data is transmitted for each of the four luminance blocks.
The CBPC bits are encoded along with another flag that provides information about the type of quantization for the macroblock. These flags are combined to form a parameter called MCBPC, and MCBPC is entropy coded using an entropy coding method such as Huffman or arithmetic coding. The CBPY flags are also entropy coded using either Huffman or arithmetic coding.
While the COD bit has advantages in the coding of scenes with very little motion, it is inefficient for scenes that change frequently and have very few macro blocks with zero motion vectors (i.e. motion vectors indicating zero motion). Thus, there is a need for a more efficient application of the COD bit for these types of scenes.
The variable length code for CBPY is based on the assumption that intraframe macroblocks include more coded luminance blocks than non-coded blocks, while for inter macroblocks, the opposite is true. This assumption is violated in some cases, and thus, leads to inefficient coding of the CBPY flags.