Full-motion video displays based upon analog video signals have long been available in the form of television. With recent advances in computer processing capabilities and affordability, full-motion video displays based upon digital video signals are becoming more widely available. Digital video systems can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, and playing full-motion video sequences.
Digital video displays include large numbers of image frames that are played or rendered successively at frequencies of between 30 and 75 Hz. Each image frame is a still image formed from an array of pixels based on the display resolution of a particular system. As examples, VHS-based systems have display resolutions of 320×480 pixels, NTSC-based systems have display resolutions of 720×486 pixels, and high-definition television (HDTV) systems under development have display resolutions of 1360×1024 pixels.
The amounts of raw digital information included in video sequences are massive. Storage and transmission of these amounts of video information is infeasible with conventional personal computer equipment. Consider, for example, a digitized form of a relatively low resolution VHS image format having a 320×480 pixel resolution. A full-length motion picture of two hours in duration at this resolution corresponds to 100 gigabytes of digital video information. By comparison, conventional compact optical disks (CDs) have capacities of about 0.6 gigabytes, magnetic hard disks have capacities of 1-2 gigabytes, and compact optical disks under development have capacities of up to 8 gigabytes.
To address the limitations in storing or transmitting such massive amounts of digital video information, various video compression standards or processes have been established, including MPEG-1, MPEG-2, and H.26X. These video compression techniques utilize similarities between successive image frames, referred to as temporal or interframe correlation, to provide interframe compression in which motion data and error signals are used to encode changes between frames.
In addition, the conventional video compression techniques utilize similarities within image frames, referred to as spatial or intraframe correlation, to provide intraframe compression in which the image samples within an image frame are compressed. Intraframe compression is based upon conventional processes for compressing still images, such as discrete cosine transform (DCT) encoding. This type of coding is sometimes referred to as “texture” or “transform” coding. A “texture” generally refers to a two-dimensional array of image sample values, such as an array of chrominance and luminance values or an array of alpha (opacity) values. The term “transform” in this context refers to how the image samples are transformed into spatial frequency components during the coding process. This use of the term “transform” should be distinguished from a geometric transform used to estimate scene changes in some interframe compression methods.
Interframe compression typically utilizes motion estimation and compensation to encode scene changes between frames. Motion estimation is a process for estimating the motion of image samples (e.g., pixels) between frames. Using motion estimation, the encoder attempts to match blocks of pixels in one frame with corresponding pixels in another frame. After the most similar block is found in a given search area, the change in position of the pixel locations of the corresponding pixels is approximated and represented as motion data, such as a motion vector. Motion compensation is a process for determining a predicted image and computing the error between the predicted image and the original image. Using motion compensation, the encoder applies the motion data to an image and computes a predicted image. The difference between the predicted image and the input image is called the error signal. Since the error signal is just an array of values representing the difference between image sample values, it can be compressed using the same texture coding method as used for intraframe coding of image samples.
I. Macroblock-Based Intra- and Inter-Frame Compression
Although differing in specific implementations, the MPEG-1, MPEG-2, and H.26X video compression standards are similar in a number of respects. The following description of the MPEG-2 video compression standard is generally applicable to the others.
MPEG-2 provides interframe compression and intraframe compression based upon square blocks or arrays of pixels in video images. A video image is divided into image sample blocks called macroblocks having dimensions of 16×16 pixels. In MPEG-2, a macroblock comprises four luminance blocks (each block is 8×8 samples of luminance (Y)) and two chrominance blocks (one 8×8 sample block each for Cb and Cr).
In MPEG-2, interframe coding is performed on macroblocks. An MPEG-2 encoder performs motion estimation and compensation to compute motion vectors and block error signals. For each block MN in an image frame N, a search is performed across the image of a next successive video frame N+1 or immediately preceding image frame N−1 (i.e., bi-directionally) to identify the most similar respective blocks MN+1 or MN−1. The location of the most similar block relative to the block MN is encoded with a motion vector (DX,DY). The motion vector is then used to compute a block of predicted sample values. These predicted sample values are compared with block MN to determine the block error signal. The error signal is compressed using a texture coding method such as discrete cosine transform (DCT) encoding.
II. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 1, an interlaced video frame 100 includes top field 110 and bottom field 120. Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
III. Vertical Macroblock Alignment
As previously remarked, previously existing video systems have had a variety of display resolutions or picture sizes, such as the NTSC resolution of 720×486 pixels, etc. On the other hand, digital video compression standards generally are macroblock-based. Digital video compression standards therefore have generally restricted the vertical size or height of the video to be an integral number of macroblocks. In cases where the vertical size of input video is less than an integral number of macroblocks, the pictures are extended or padded out to an integral number of macroblocks.
In most video standards, there is a different height restriction for video sequences coded entirely in progressive mode than is imposed on interlaced or mixed interlaced/progressive content. In particular, a video sequence of entirely progressive content must have a height that is a multiple of 16 pixels (an integral number of progressive mode macroblocks). Whereas, a video sequence with interlaced or mixed interlaced/progressive content is restricted in height to multiples of 32 pixels. This is because each field of the interlaced content is required to be an integral number of macroblocks in height, such that the frame as a whole (which is composed of two fields) must be a multiple of 2 macroblocks in height. In cases where the input video does not meet these requirements, such as video captured at NTSC resolution, the video is extended to a legal height through padding prior to compression. Later, the decoder decodes the height extended data and extracts out the original video content. This padding of the video introduces overhead in the encoding as the extended data outside the original video (padding) has to be compressed and sent to the decoder.