Full-motion video displays based upon analog video signals have long been available in the form of television. With recent advances in computer processing capabilities and affordability, full-motion video displays based upon digital video signals are becoming more widely available. Digital video systems can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, and playing full-motion video sequences.
Digital video displays include large numbers of image frames that are played or rendered successively at frequencies of between 30 and 75 Hz. Each image frame is a still image formed from an array of pixels based on the display resolution of a particular system. As examples, VHS-based systems have display resolutions of 320×480 pixels, NTSC-based systems have display resolutions of 720×486 pixels, and high-definition television (HDTV) systems under development have display resolutions of 1360×1024 pixels.
The amounts of raw digital information included in video sequences are massive. Storage and transmission of these amounts of video information is infeasible with conventional personal computer equipment. Consider, for example, a digitized form of a relatively low resolution VHS image format having a 320×480 pixel resolution. A full-length motion picture of two hours in duration at this resolution corresponds to 100 gigabytes of digital video information. By comparison, conventional compact optical disks (CDs) have capacities of about 0.6 gigabytes, magnetic hard disks have capacities of 1-2 gigabytes, and compact optical disks under development have capacities of up to 8 gigabytes.
To address the limitations in storing or transmitting such massive amounts of digital video information, various video compression standards or processes have been established, including MPEG-1, MPEG-2, and H.26X. These video compression techniques utilize similarities between successive image frames, referred to as temporal or interframe correlation, to provide interframe compression in which motion data and error signals are used to encode changes between frames.
In addition, the conventional video compression techniques utilize similarities within image frames, referred to as spatial or intraframe correlation, to provide intraframe compression in which the image samples within an image frame are compressed. Intraframe compression is based upon conventional processes for compressing still images, such as discrete cosine transform (DCT) encoding. This type of coding is sometimes referred to as “texture” or “transform” coding. A “texture” generally refers to a two-dimensional array of image sample values, such as an array of chrominance and luminance values or an array of alpha (opacity) values. The term “transform” in this context refers to how the image samples are transformed into spatial frequency components during the coding process. This use of the term “transform” should be distinguished from a geometric transform used to estimate scene changes in some interframe compression methods.
Interframe compression typically utilizes motion estimation and compensation to encode scene changes between frames. Motion estimation is a process for estimating the motion of image samples (e.g., pixels) between frames. Using motion estimation, the encoder attempts to match blocks of pixels in one frame with corresponding pixels in another frame. After the most similar block is found in a given search area, the change in position of the pixel locations of the corresponding pixels is approximated and represented as motion data, such as a motion vector. Motion compensation is a process for determining a predicted image and computing the error between the predicted image and the original image. Using motion compensation, the encoder applies the motion data to an image and computes a predicted image. The difference between the predicted image and the input image is called the error signal. Since the error signal is just an array of values representing the difference between image sample values, it can be compressed using the same texture coding method as used for intraframe coding of image samples.
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 1, an interlaced video frame 100 includes top field 110 and bottom field 120. Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
II. Display Ordering and Pull-down
The order in which decoded pictures are displayed is called the display order. The order in which the pictures are transmitted and decoded is called the coded order. The coded order is the same as the display order if there are no B-frames in the sequence. However, if B-frames are present, the coded order may not be the same as the display order because B-frames typically use temporally future reference frames as well as temporally past reference frames.
Pull-down is a process where video frame rate is artificially increased through repeated display of the same decoded frames or fields in a video sequence. Pull-down is typically performed in conversions from film to video or vice versa, or in conversions between video formats having different frame rates. For example, pull-down is performed when 24-frame-per-second film is converted to 30-frame-per-second or 60-frame-per-second video.
III. Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG 2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union [“ITU”]. These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. These standards use (or support the use of) different combinations of intraframe and interframe decompression and compression.
A. Signaling for Field Ordering and Field/Frame Repetition in the Standards
Some international standards describe bitstream elements for signaling field display order and for signaling whether certain fields or frames are to be repeated during display. The H.262 standard uses picture coding extension elements top_field_first and repeat_first_field to indicate field display order and field display repetition. When the sequence extension syntax element progressive_sequence is set to 1 (indicating the coded video sequence contains only progressive frames), top_field_first and repeat_first_field indicate how many times a reconstructed frame is to be output (i.e., once, twice or three times) by an H.262 decoder. When progressive13 sequence is 0 (indicating the coded video sequence many contain progressive or interlaced frames (frame-coded or field-coded)), top_field_first indicates which field of a reconstructed frame the decoder outputs first, and repeat_first_field indicates whether the first field in the frame is to be repeated in the output of the decoder.
The MPEG 4 standard describes a top_field_first element for indicating field display order. In MPEG 4, top_field_first is a video object plane syntax element that indicates which field (top or bottom) of a reconstructed video object plane the decoder outputs first.
According to draft JVT-d157of the JVT/AVC video standard, the slice header element pic_structure takes on one of five values to identify a picture as being one of five types: progressive frame, top field, bottom field, interlaced frame with top field first in time, or interlaced frame with bottom field first in time.
B. Limitations of the Standards
These international standards are limited in that they do not allow for signaling to indicate the presence or absence of bitstream elements for (1) signaling field display order and (2) signaling whether certain fields or frames are to be repeated during display. For example, although the H.262 standard uses picture coding extension elements top_field_first and repeat_first_field, the H.262 standard does not have a mechanism to “turn off” such elements when they are not needed.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.
IV. Repeat Padding
As previously remarked, interframe compression typically is performed by performing motion estimation and prediction for the macroblocks in a predicted frame with respect to a reference intra-coded frame. Some previously existing video systems have permitted the motion estimation to extend beyond the active picture contents of the reference intra-coded frame. In some such cases, the video systems have derived the “content” outside the picture by repeating the pixels of the picture edge to “fill” an extended region that may be used for motion estimation purposes. For example, the bottom row of the picture is repeated to vertically expand the picture downward to fill an extended motion estimation region below the picture. Likewise, the top row, left and right columns are repeated at top left and right sides to provide extended motion estimation regions at those sides of the reference picture. This process of filling areas outside the active picture content is sometimes referred to as “repeat padding.”