Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to a preceding and/or following frame (typically called a reference or anchor frame) or frames (for B-frames).
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing in raster scan fashion (left to right, top to bottom) through successive, non-alternating lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
The primary aspect of interlaced video is that the raster scan of an entire video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. This results in each frame containing two fields representing two different time epochs. FIG. 1 shows an interlaced video frame (100) that includes top field (110) and bottom field (120). In the frame (100), the even-numbered lines (top field) are scanned starting at one time (e.g., time t), and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure for coding, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
II. Signaling Frame Type Information in Windows Media Video, Version 9
Microsoft Corporation's Windows Media Video, Version 9 [“WMV9”] includes a video encoder and a video decoder. The encoder and decoder may process progressive or interlaced video content.
For a video sequence, a one-bit sequence-layer syntax element INTERLACE specifies whether the video data is coded in progressive or interlaced mode. If INTERLACE=0, then the video frames are coded in progressive mode. If INTERLACE=1, then the video frames are coded in interlaced mode. Another sequence-layer syntax element NUMBFRAMES is a three-bit field that indicates the number of consecutive B-frames between I- or P-frames. If NUMBFRAMES=0, then there are no B-frames in the video sequence.
A compressed video frame is made up of data structured into three hierarchical layers. From top to bottom the layers are: picture, macroblock, and block. For a frame, a picture-layer syntax element PTYPE indicates whether the frame is an I-frame, P-frame, or B-frame. If NUMBFRAMES=0, then only I- and P-frames are present in the sequence, and PTYPE is a signaled with a fixed-length code [“FLC”] as shown in FIG. 2A. If NUMBFRAMES is greater than 0, then B-frames are present in the sequence, and PTYPE is a variable-length code [“VLC”] indicating the picture type of the frame, as shown in FIG. 2B. Thus, the INTERLACE, NUMBFRAMES, and PTYPE elements collectively may indicate the following types of frames: progressive I-frame, interlaced I-frame, progressive P-frame, interlaced P-frame, progressive B-frame, and interlaced B-frame.
While the encoder and decoder are efficient for many different scenarios and types of content, there is room for improvement in several places. In particular, the encoder and decoder cannot process interlaced video frames as separate fields. Instead, the encoder and decoder process interlaced video frames using frame coding/decoding. A macroblock of an interlaced video frame includes alternating lines from both fields of the frame. The macroblock itself may be frame-coded or field-coded, but separate coding of top and bottom fields as separate pictures is not allowed. This limits inter-operability with codecs that comply with certain international standards. In addition, coding interlaced video with frame coding can be inefficient for certain kinds of content (e.g., high-motion video).
III. Signaling Picture Type Information According to Various Standards
Aside from previous WMV encoders and decoders, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another name for MPEG 2), H.263, and H.264 standards from the International Telecommunication Union [“ITU”]. An encoder and decoder complying with one of these standards typically use a combination of intra-picture and inter-picture compression and decompression. The different standards describe different signaling mechanisms for picture type information.
A. H.262 Standard
According to the H.262 standard, a progressive video frame is coded as a frame picture, and the two fields of an interlaced video frame may be coded together (as a frame picture) or as separate fields (as field pictures). [H.262 standard, section 6.1.1.1.] The three picture types are I-, P-, and B-pictures. [H.262 standard, section 6.1.1.5.]
Various rules address combinations of field pictures for interlaced video frames in the H.262 standard. [H.262 standard, section 6.1.1.4.1.] When the first picture of a coded frame is a P-field picture, the second picture of the frame is a P-field picture. When the first picture of a coded frame is a B-field picture, the second picture of the frame is a B-field picture. When the first picture of a coded frame is an I-field picture, the second picture of the frame is either an I-field picture or a P-field picture. [Id.]
A sequence-layer syntax element “progressive_sequence,” when set to 1, indicates the sequence contains only progressive frame pictures. [H.262 standard, section 6.3.5.] When progressive_sequence is set to 0, the sequence may contain both frame pictures and field pictures, and the frame pictures may be progressive or interlaced frames. [Id.]
In a picture header, a three-bit FLC “picture_coding_jype” identifies whether a picture is an I-picture, P-picture or B-picture. [H.262 standard, section 6.3.9.] Also signaled for a picture, the “picture_structure” element is a two-bit FLC that indicates whether the picture is a top field (field picture), bottom field (field picture), or frame picture (either progressive or interlaced). [H.262 standard, section 6.3.10.] A one-bit element “progressive_frame” signaled for a picture indicates whether two fields of a frame are interlaced fields or are actually from the same time instant as one another.
The signaling of picture type information described in the H.262 standard may be efficient for certain scenarios and types of content. For field pictures of interlaced video frames, however, the signaling of type information uses an inefficient amount of bits.
B. H.264 Standard
According to draft JVT-D157 of the H.264 standard, a slice is a number of macroblocks in a picture. A particular picture (either video frame or field) may include multiple slices. Or, the picture may include a single slice.
In the slice header for a slice, the syntax element “pic_structure” identifies the picture structure for the slice as progressive frame picture, top field picture, bottom field picture, interlaced frame picture whose top field precedes its bottom field in time, or interlaced frame picture whose bottom field precedes its top field in time. The pic_structure element is signaled as an unsigned integer Exp-Golomb-coded syntax element, which is a kind of VLC.
Also in the slice header for a slice, the syntax element “slice_type_idc” indicates the coding type of the slice as Pred (P-slice), BiPred (B-slice), Intra (I-slice), SPred (SP-slice), or Sintra (SI-slice). The slice_type_idc element is also signaled as an unsigned integer Exp-Golomb-coded syntax element.
C. Other Standards
According to the H.261 standard, a PTYPE element signals information about a completed picture (e.g., source video format) and an MTYPE element signals whether a macroblock is intra- or inter-coded. The H.261 standard does not address interlaced coding modes. Moreover, the H.261 standard does not have picture types such as I, P, and B.
The H.263 and MPEG-1 standards describe picture types (e.g., I, P, B, PB, EI, or EP in H.263; I, P, B, or D in MPEG-1) signaled per frame. These standards do not address interlaced coding modes, however.
According to the MPEG-4 standard, a VOP_coding_type element signaled per video object plane [“VOP”] indicates whether the VOP is of coding type I, P, B, or S. A VOP may contain interlaced video, but interlaced VOPs are frame coded, not field coded as separate fields.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.