Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 pictures per second. Each picture can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. Intra compression techniques compress individual pictures, typically called I-pictures or key pictures. Inter compression techniques compress pictures with reference to preceding and/or following pictures, and are typically called predicted pictures, P-pictures, or B-pictures.
I. Inter Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Early versions of Windows Media Video, Version 9 [“WMV9”] use a similar architecture for many operations.
Inter compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 1 and 2 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 1 illustrates motion estimation for a predicted frame 110 and FIG. 2 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 1, the WMV8 encoder computes a motion vector for a macroblock 115 in the predicted frame 110. To compute the motion vector, the encoder searches in a search area 135 of a reference frame 130. Within the search area 135, the encoder compares the macroblock 115 from the predicted frame 110 to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock.
Since a motion vector value is often correlated with the values of spatially surrounding motion vectors, compression of the data used to transmit the motion vector information can be achieved by selecting a motion vector predictor from neighboring macroblocks and predicting the motion vector for the current macroblock using the predictor. The encoder can encode the differential between the motion vector and the predictor. After reconstructing the motion vector by adding the differential to the predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 115 using information from the reference frame 130, which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 115 itself.
FIG. 2 illustrates an example of computation and encoding of an error block 235 in the WMV8 encoder. The error block 235 is the difference between the predicted block 215 and the original current block 225. The encoder applies a DCT 240 to the error block 235, resulting in an 8×8 block 245 of coefficients. The encoder then quantizes 250 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 255. The encoder scans 260 the 8×8 block 255 into a one-dimensional array 265 such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding 270. The encoder selects an entropy code from one or more run/level/last tables 275 and outputs the entropy code.
FIG. 3 shows an example of a corresponding decoding process 300 for an inter-coded block. In summary of FIG. 3, a decoder decodes (310, 320) entropy-coded information representing a prediction residual using variable length decoding 310 with one or more run/level/last tables 315 and run length decoding 320. The decoder inverse scans 330 a one-dimensional array 325 storing the entropy-decoded information into a two-dimensional block 335. The decoder inverse quantizes and inverse discrete cosine transforms (together, 340) the data, resulting in a reconstructed error block 345. In a separate motion compensation path, the decoder computes a predicted block 365 using motion vector information 355 for displacement from a reference frame. The decoder combines 370 the predicted block 365 with the reconstructed error block 345 to form the reconstructed block 375.
The amount of change between the original and reconstructed frames is the distortion and the number of bits required to code the frame indicates the rate for the frame. The amount of distortion is roughly inversely proportional to the rate.
II. Interlaced Video and Progressive Video
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 4, an interlaced video frame 400 includes top field 410 and bottom field 420. Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present because the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frame, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
III. Interlace P-frame Coding and Decoding in Early Versions of WMV9
Early versions of Windows Media Video, Version 9 [“WMV9”] use interlace P-frame coding and decoding. In these early versions of WMV9, interlaced P-frames can contain macroblocks encoded in field mode or in frame mode, or skipped macroblocks, with a decision generally made on a macroblock-by-macroblock basis. Two motion vectors are associated with each field-coded macroblock, and one motion vector is associated with each frame-coded macroblock. An encoder jointly encodes motion information for the blocks in the macroblock, including horizontal and vertical motion vector differential components, potentially along with other signaling information.
In the encoder, a motion vector is encoded by computing a differential between the motion vector and a motion vector predictor, which is computed based on neighboring motion vectors. And, in the decoder, the motion vector is reconstructed by adding the motion vector differential to the motion vector predictor, which is again computed (this time in the decoder) based on neighboring motion vectors.
FIGS. 5, 6, and 7 show examples of candidate predictors for motion vector prediction for frame-coded macroblocks and field-coded macroblocks, respectively, in interlaced P-frames in early versions of WMV9. FIG. 5 shows candidate predictors A, B and C for a current frame-coded macroblock in an interior position in an interlaced P-frame (not the first or last macroblock in a macroblock row, not in the top row). Predictors can be obtained from different candidate directions other than those labeled A, B, and C (e.g., in special cases such as when the current macroblock is the first macroblock or last macroblock in a row, or in the top row, since certain predictors are unavailable for such cases). For a current frame-coded macroblock, predictor candidates are calculated differently depending on whether the neighboring macroblocks are field-coded or frame-coded. For a neighboring frame-coded macroblock, the motion vector is simply taken as the predictor candidate. For a neighboring field-coded macroblock, the candidate motion vector is determined by averaging the top and bottom field motion vectors.
FIGS. 6 and 7 show candidate predictors A, B and C for a current field in a field-coded macroblock that is not the first or last macroblock in a macroblock row, and not in the top row. In FIG. 6, the current field is a bottom field, and the bottom field motion vectors in the neighboring macroblocks are used as candidate predictors. In FIG. 7, the current field is a top field, and the top field motion vectors are used as candidate predictors. Thus, for each field in a current field-coded macroblock, the number of motion vector predictor candidates for each field is at most three, with each candidate coming from the same field type (e.g., top or bottom) as the current field.
A predictor for the current macroblock or field of the current macroblock is selected based on the candidate predictors, and a motion vector differential is calculated based on the predictor. The motion vector can be reconstructed by adding the motion vector differential to the selected motion vector predictor at either the encoder or the decoder side. Typically, luminance motion vectors are reconstructed from the encoded motion information, and chrominance motion vectors are derived from the reconstructed luminance motion vectors.
IV. Standards for Video Compression and Decompression
Aside from WMV8 and early versions of WMV9, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262, H.263, and H.264 standards from the International Telecommunication Union [“ITU”]. One of the primary methods used to achieve data compression of digital video sequences in the international standards is to reduce the temporal redundancy between pictures. These popular compression schemes (MPEG-1, MPEG-2, MPEG4, H.261, H.263, etc) use motion estimation and compensation. For example, a current frame is divided into uniform square regions (e.g., blocks and/or macroblocks). A matching region for each current region is specified by sending motion vector information for the region. The motion vector indicates the location of the region in a previously coded (and reconstructed) frame that is to be used as a predictor for the current region. A pixel-by-pixel difference, called the error signal, between the current region and the region in the reference frame is derived. This error signal usually has lower entropy than the original signal. Therefore, the information can be encoded at a lower rate. As in WMV8 and early versions of WMV9, since a motion vector value is often correlated with spatially surrounding motion vectors, compression of the data used to represent the motion vector information can be achieved by coding the differential between the current motion vector and a predictor based upon previously coded, neighboring motion vectors.
In addition, some international standards describe motion estimation and compensation in interlaced video frames. The H.262 standard allows an interlaced video frame to be encoded as a single frame or as two fields, where the frame encoding or field encoding can be adaptively selected on a frame-by-frame basis. The H.262 standard describes field-based prediction, which is a prediction mode using only one field of a reference frame. The H.262 standard also describes dual-prime prediction, which is a prediction mode in which two forward field-based predictions are averaged for a 16×16 block in an interlaced P-picture. Section 7.6 of the H.262 standard describes “field prediction,” including selecting between two reference fields to use for motion compensation for a macroblock of a current field of an interlaced video frame. Section 7.6.3 describes motion vector prediction and reconstruction, in which a reconstructed motion vector for a given macroblock becomes the motion vector predictor for a subsequently encoded/decoded macroblock. Such motion vector prediction fails to adequately predict motion vectors for macroblocks of fields of interlaced video frames in many cases.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.