Full-motion video displays based upon analog video signals have long been available in the form of television. With recent advances in computer processing capabilities and affordability, full-motion video displays based upon digital video signals are becoming more widely available. Digital video systems can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, and playing full-motion video sequences.
Digital video displays include large numbers of image frames that are played or rendered successively at frequencies of between 30 and 75 Hz. Each image frame is a still image formed from an array of pixels based on the display resolution of a particular system. As examples, VHS-based systems have display resolutions of 320×480 pixels, NTSC-based systems have display resolutions of 720×486 pixels, and high-definition television (HDTV) systems under development have display resolutions of 1360×1024 pixels.
The amounts of raw digital information included in video sequences are massive. Storage and transmission of these amounts of video information is infeasible with conventional personal computer equipment. Consider, for example, a digitized form of a relatively low resolution VHS image format having a 320×480 pixel resolution. A full-length motion picture of two hours in duration at this resolution corresponds to 100 gigabytes of digital video information. By comparison, conventional compact optical disks have capacities of about 0.6 gigabytes, magnetic hard disks have capacities of 1–2 gigabytes, and compact optical disks under development have capacities of up to 8 gigabytes.
To address the limitations in storing or transmitting such massive amounts of digital video information, various video compression standards or processes have been established, including MPEG-1, MPEG-2, and H.26X. These video compression techniques utilize similarities between successive image frames, referred to as temporal or interframe correlation, to provide interframe compression in which motion data and error signals are used to encode changes between frames.
In addition, the conventional video compression techniques utilize similarities within image frames, referred to as spatial or intraframe correlation, to provide intraframe compression in which the image samples within an image frame are compressed. Intraframe compression is based upon conventional processes for compressing still images, such as discrete cosine transform (DCT) encoding. This type of coding is sometimes referred to as “texture” or “transform” coding. A “texture” generally refers to a two-dimensional array of image sample values, such as an array of chrominance and luminance values or an array of alpha (opacity) values. The term “transform” in this context refers to how the image samples are transformed into spatial frequency components during the coding process. This use of the term “transform” should be distinguished from a geometric transform used to estimate scene changes in some interframe compression methods.
Interframe compression typically utilizes motion estimation and compensation to encode scene changes between frames. Motion estimation is a process for estimating the motion of image samples (e.g., pixels) between frames. Using motion estimation, the encoder attempts to match blocks of pixels in one frame with corresponding pixels in another frame. After the most similar block is found in a given search area, the change in position of the pixel locations of the corresponding pixels is approximated and represented as motion data, such as a motion vector. Motion compensation is a process for determining a predicted image and computing the error between the predicted image and the original image. Using motion compensation, the encoder applies the motion data to an image and computes a predicted image. The difference between the predicted image and the input image is called the error signal. Since the error signal is just an array of values representing the difference between image sample values, it can be compressed using the same texture coding method as used for intraframe coding of image samples.
Although differing in specific implementations, the MPEG-1, MPEG-2, and H.26X video compression standards are similar in a number of respects. The following description of the MPEG-2 video compression standard is generally applicable to the others.
MPEG-2 provides interframe compression and intraframe compression based upon square blocks or arrays of pixels in video images. A video image is divided into image sample blocks called macroblocks having dimensions of 16×16 pixels. In MPEG-2, a macroblock comprises four luminance blocks (each block is 8×8 samples of luminance (Y)) and two chrominance blocks (one 8×8 sample block each for Cb and Cr).
In MPEG-2, interframe coding is performed on macroblocks. An MPEG-2 encoder performs motion estimation and compensation to compute motion vectors and block error signals. For each block MN in an image frame N, a search is performed across the image of a next successive video frame N+1 or immediately preceding image frame N−1 (i.e., bi-directionally) to identify the most similar respective blocks MN+1 or MN−1. The location of the most similar block relative to the block MN is encoded with a motion vector (DX,DY). The motion vector is then used to compute a block of predicted sample values. These predicted sample values are compared with block MN to determine the block error signal. The error signal is compressed using a texture coding method such as discrete cosine transform (DCT) encoding.
Object-based video coding techniques have been proposed as an improvement to the conventional frame-based coding standards. In object-based coding, arbitrary shaped image features are separated from the frames in the video sequence using a method called “segmentation.” The video objects or “segments” are coded independently. Object-based coding can improve the compression rate because it increases the interframe correlation between video objects in successive frames. It is also advantageous for variety of applications that require access to and tracking of objects in a video sequence.
In the object-based video coding methods proposed for the MPEG-4 standard, the shape, motion and texture of video objects are coded independently. The shape of an object is represented by a binary or alpha mask that defines the boundary of the arbitrary shaped object in a video frame. The motion of an object is similar to the motion data of MPEG-2, except that it applies to an arbitrary-shaped image of the object that has been segmented from a rectangular frame. Motion estimation and compensation is performed on blocks of a “video object plane” rather than the entire frame. The video object plane is the name for the shaped image of an object in a single frame.
The texture of a video object is the image sample information in a video object plane that falls within the object's shape. Texture coding of an object's image samples and error signals is performed using similar texture coding methods as in frame-based coding. For example, a segmented image can be fitted into a bounding rectangle formed of macroblocks. The rectangular image formed by the bounding rectangle can be compressed just like a rectangular frame, except that transparent macroblocks need not be coded. Partially transparent blocks are coded after filling in the portions of the block that fall outside the object's shape boundary with sample values in a technique called “padding.”
Frame-based coding techniques such as MPEG-2 and H26X and object-based coding techniques proposed for MPEG-4 are similar in that they perform intraframe and interframe coding on macroblocks. Each macroblock includes a series of overhead parameters that provide information about the macroblock. As an example, FIG. 1 shows macroblock parameters used in the header of an interframe macroblock. The COD parameter (10) is a single bit indicating whether the interframe macroblock is coded. In particular, this bit indicates whether or not the encoded macroblock includes motion data and texture coded error data. In cases where the motion and error signal data are zero, the COD bit reduces the information needed to code the macroblock because only a single bit is sent rather than additional bits indicating that the motion vector and texture data are not coded.
In addition to the COD bit, the coding syntax for macroblocks includes coded block parameters (CBP) indicating whether the coded transform coefficients for chrominance and luminance are transmitted for the macroblock. If the transform coefficients are all zero for a block, then there is no need to send texture data for the block. The Coded Block Parameters for chrominance (CBPC) are two bits indicating whether or not coded texture data is transmitted for each of the two chrominance blocks.
The CBPC bits are encoded along with another flag that provides information about the type of quantization for the macroblock. These flags are combined to form a parameter called MCBPC (12), and MCBPC is entropy coded using an entropy coding method such as Huffman or arithmetic coding.
The parameter called the AC_Pred_flag (14) is a flag indicating whether AC prediction is used in the macroblock.
The Coded Block Pattern for luminance (CBPY) (16) is comprised of four bits indicating whether or not coded texture data is transmitted for each of the four luminance blocks. Like the MCBPC parameter, the CBPY flags are also entropy coded using either Huffman or arithmetic coding.
After the CBPY parameter, the macroblock includes encoded motion vector data (shown as item 18 in FIG. 1). Following the motion vector data, the “block data” represents the encoded texture data for the macroblock (shown as block data 20 in FIG. 1).
One drawback of the coding approach illustrated in FIG. 1 is that it codes CBPC and CBPY flags separately, and therefore, does not exploit the correlation between these parameters to reduce the macroblock overhead. In addition, it does not take advantage of the spatial dependency of the coded block parameters.