Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 pictures per second. Each picture can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. Intra compression techniques compress individual pictures, typically called I-frames or key frames for progressive video frames. Inter compression techniques compress frames with reference to preceding and/or following frames, and inter-compressed frames are typically called predicted frames, P-frames, or B-frames.
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 1, an interlaced video frame (100) includes top field (110) and bottom field (120). Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
II. Display Ordering and Pull-Down
The order in which decoded pictures are displayed is called the display order. The order in which the pictures are transmitted and decoded is called the coded order. The coded order is the same as the display order if there are no B-frames in the sequence. However, if B-frames are present, the coded order may not be the same as the display order because B-frames typically use temporally future reference frames as well as temporally past reference frames, and a temporally future reference frame for a B-frame precedes the B-frame in coded order.
Pull-down is a process where video frame rate is artificially increased through repeated display of the same decoded frames or fields in a video sequence. Pull-down is typically performed in conversions from film to video or vice versa, or in conversions between video formats having different frame rates. For example, pull-down is performed when 24-frame-per-second film is converted to 30-frame-per-second or 60-frame-per-second video.
III. Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG 2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union [“ITU”]. These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. Codecs designed in compliance with these standards use (or support the use of) different combinations of intra-picture and inter-picture decompression and compression.
A. Signaling for Field Ordering and Field/Frame Repetition in the Standards
Some international standards describe bitstream elements for signaling field display order and for signaling whether certain fields or frames are to be repeated during display. The H.262 standard uses picture coding extension elements top_field_first and repeat_first_field to indicate field display order and field display repetition. When the sequence extension syntax element progressive_sequence is set to ‘1’ (indicating the coded video sequence contains only progressive frames), top_field_first and repeat_first_field indicate how many times a reconstructed frame is to be output (i.e., once, twice or three times) by an H.262 decoder. When progressive_sequence is ‘0’ (indicating the coded video sequence many contain progressive or interlaced frames (frame-coded or field-coded)), top_field_first indicates which field of a reconstructed frame the decoder outputs first, and repeat_first_field indicates whether the first field in the frame is to be repeated in the output of the decoder.
The MPEG 4 Part 10 Video standard describes a top_field_first element for indicating field display order. In MPEG 4, top_field_first is a video object plane syntax element that indicates which field (top or bottom) of a reconstructed video object plane the decoder outputs first.
According to draft JVT-d157 of the JVT/AVC video standard, the slice header element pic_structure takes on one of five values to identify a picture as being one of five types: progressive frame, top field, bottom field, interlaced frame with top field first in time, or interlaced frame with bottom field first in time.
B. Hypothetical Reference Decoders in the Standards
For many video codecs and coding standards, a bitstream is compliant if it can be decoded, at least conceptually, by a mathematical model of a decoder that is connected to the output of an encoder. For example, such a model decoder is known as a hypothetical reference decoder [“HRD”] in the H.263 coding standard, and a video buffering verifier [“VBV”] in the H.262 coding standard. In general, a real decoder device (or terminal) comprises a decoder buffer, a decoder, and a display unit. If a real decoder device is constructed according to the mathematical model of the decoder, and a compliant bitstream is transmitted to the device under specific conditions, then the decoder buffer will not overflow or underflow and decoding will be performed correctly.
Some previous reference (model) decoders assume that a bitstream will be transmitted through a channel at a given constant bit rate, and will be decoded (after a given buffering delay) by a device having some given buffer size. Therefore, these models are quite inflexible in that they do not address the requirements of many of today's important video applications such as broadcasting live video, or streaming pre-encoded video on demand over network paths with various peak bit rates, to devices with various buffer sizes.
In these previous reference decoders, the video bitstream is received at a given constant bit rate (usually the average rate in bits per second of the stream) and is stored in the decoder buffer until the buffer reaches some desired level of fullness. For example, at least the data corresponding to one initial frame of video information is needed before decoding can reconstruct an output frame therefrom. This desired level is denoted as the initial decoder buffer fullness and, at a constant bit rate, is directly proportional to a transmission or start-up (buffer) delay expressed in units of time. Once this fullness is reached, the reference decoder instantaneously removes the bits for the first video frame or field of the sequence, and decodes the bits to display the frame or field. The decoder buffer may operate on a frame or a field basis. For example, the MPEG-2 Video standard manages the buffer model on a picture basis; in the progressive mode, a picture is a frame while in the interlaced mode, it is a field). The bits for the following frames are also removed, decoded, and displayed instantaneously at subsequent time intervals.
The MPEG-2 video standard includes a “vbv_delay” parameter, which is present in the header of each picture to indicate the time required to load data into the elementary stream buffer before decoding can start. However, in the case of Variable Bit Rate encoding, the vbv_delay value and the value of the bit_rate field in the MPEG-2 sequence header are often not sufficient to derive a time at which the first video access unit can be decoded. As a result, upon tuning or seeking to a location in an MPEG-2 video bitstream, decoding time for the first video access unit is typically derived from the underlying transport protocol. In the case of MPEG-2 Transport, a DTS (Decoding Time Stamp) in the PES header defines the time at which decoding should occur.
C. Limitations of the Standards
The international standards are limited in their management of the decoder buffer. For example, an MPEG-2 encoder produces and inserts in the bitstream a delay value such as vbv_delay, which requires a time calculation. MPEG-2 time stamps are also dependent on the underlying synchronization layer (timing units and timing accuracy therefore need to be factored in). Accordingly, the resulting MPEG-2 video elementary stream cannot be carried back and forth across various transport protocols (such as ASF, MPEG-2 Systems, and RTP) without conversions, which require more calculations and negatively impact the accuracy of the decoder timing values. This is undesirable, as today's digital video distribution systems are becoming more complex and typically involve some type of transport re-mapping at some point in the delivery chain.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.