Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include intraframe compression and interframe compression. Intraframe compression techniques compress individual frames, typically called I-frames or key frames. Interframe compression techniques compress frames with reference to preceding and/or following frames, which are typically called predicted frames, P-frames, or B-frames.
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intraframe and interframe compression, and the WMV8 decoder uses intraframe and interframe decompression.
A. Intraframe Compression in WMV8
FIG. 1 illustrates block-based intraframe compression 100 of a block 105 of pixels in a key frame in the WMV8 encoder. A block is a set of pixels, for example, an 8×8 arrangement of pixels. The WMV8 encoder splits a key video frame into 8×8 blocks of pixels and applies an 8×8 Discrete Cosine Transform [“DCT”] 110 to individual blocks such as the block 105. A DCT is a type of frequency transform that converts the 8×8 block of pixels (spatial information) into an 8×8 block of DCT coefficients 115, which are frequency information. The DCT operation itself is lossless or nearly lossless.
The encoder then quantizes 120 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 125. For example, the encoder applies a uniform, scalar quantization step size to each coefficient. Quantization is lossy. The encoder then prepares the 8×8 block of quantized DCT coefficients 125 for entropy encoding, which is a form of lossless compression. The exact type of entropy encoding can vary depending on whether a coefficient is a DC coefficient (lowest frequency), an AC coefficient (other frequencies) in the top row or left column, or another AC coefficient.
The encoder encodes the DC coefficient 126 as a differential from the DC coefficient 136 of a neighboring 8×8 block, which is a previously encoded neighbor (e.g., top or left) of the block being encoded. (FIG. 1 shows a neighbor block 135 that is situated to the left of the block being encoded in the frame.) The encoder entropy encodes 140 the differential.
The entropy encoder can encode the left column or top row of AC coefficients as a differential from a corresponding column or row of the neighboring 8×8 block. FIG. 1 shows the left column 127 of AC coefficients encoded as a differential 147 from the left column 137 of the neighboring (to the left) block 135. The differential coding increases the chance that the differential coefficients have zero values. The remaining AC coefficients are from the block 125 of quantized DCT coefficients.
The encoder scans 150 the 8×8 block 145 of predicted, quantized AC DCT coefficients into a one-dimensional array 155 and then entropy encodes the scanned AC coefficients using a variation of run length coding 160. The encoder selects an entropy code from one or more run/level/last tables 165 and outputs the entropy code.
B. Interframe Compression in WMV8
Interframe compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 2 and 3 illustrate the block-based interframe compression for a predicted frame in the WMV8 encoder. In particular, FIG. 2 illustrates motion estimation for a predicted frame 210 and FIG. 3 illustrates compression of a prediction residual for a motion-estimated block of a predicted frame.
For example, the WMV8 encoder splits a predicted frame into 8×8 blocks of pixels. Groups of four 8×8 blocks form macroblocks. For each macroblock, a motion estimation process is performed. The motion estimation approximates the motion of the macroblock of pixels relative to a reference frame, for example, a previously coded, preceding frame. In FIG. 2, the WMV8 encoder computes a motion vector for a macroblock 215 in the predicted frame 210. To compute the motion vector, the encoder searches in a search area 235 of a reference frame 230. Within the search area 235, the encoder compares the macroblock 215 from the predicted frame 210 to various candidate macroblocks in order to find a candidate macroblock that is a good match. After the encoder finds a good matching macroblock, the encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock so the decoder can find the matching macroblock during decoding. When decoding the predicted frame 210 with motion compensation, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 215 using information from the reference frame 230. The prediction for the macroblock 215 is rarely perfect, so the encoder usually encodes 8×8 blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 215 itself.
FIG. 3 illustrates an example of computation and encoding of an error block 335 in the WMV8 encoder. The error block 335 is the difference between the predicted block 315 and the original current block 325. The encoder applies a DCT 340 to the error block 335, resulting in an 8×8 block 345 of coefficients. The encoder then quantizes 350 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 355. The quantization step size is adjustable. Quantization results in loss of precision, but not complete loss of the information for the coefficients.
The encoder then prepares the 8×8 block 355 of quantized DCT coefficients for entropy encoding. The encoder scans 360 the 8×8 block 355 into a one dimensional array 365 with 64 elements, such that coefficients are generally ordered from lowest frequency to highest frequency, which typically creates long runs of zero values.
The encoder entropy encodes the scanned coefficients using a variation of run length coding 370. The encoder selects an entropy code from one or more run/level/last tables 375 and outputs the entropy code.
FIG. 4 shows an example of a corresponding decoding process 400 for an inter-coded block. Due to the quantization of the DCT coefficients, the reconstructed block 475 is not identical to the corresponding original block. The compression is lossy.
In summary of FIG. 4, a decoder decodes (410, 420) entropy-coded information representing a prediction residual using variable length decoding 410 with one or more run/level/last tables 415 and run length decoding 420. The decoder inverse scans 430 a one-dimensional array 425 storing the entropy-decoded information into a two-dimensional block 435. The decoder inverse quantizes and inverse discrete cosine transforms (together, 440) the data, resulting in a reconstructed error block 445. In a separate motion compensation path, the decoder computes a predicted block 465 using motion vector information 455 for displacement from a reference frame. The decoder combines 470 the predicted block 465 with the reconstructed error block 445 to form the reconstructed block 475.
The amount of change between the original and reconstructed frame is termed the distortion and the number of bits required to code the frame is termed the rate for the frame. The amount of distortion is roughly inversely proportional to the rate. In other words, coding a frame with fewer bits (greater compression) will result in greater distortion, and vice versa.
C. Bi-Directional Prediction
Bi-directionally coded images (e.g., B-frames) use two images from the source video as reference (or anchor) images. For example, referring to FIG. 5, a B-frame 510 in a video sequence has a temporally previous reference frame 520 and a temporally future reference frame 530.
Some conventional encoders use five prediction modes (forward, backward, direct, interpolated and intra) to predict regions in a current B-frame. In intra mode, an encoder does not predict a macroblock from either reference image, and therefore calculates no motion vectors for the macroblock. In forward and backward modes, an encoder predicts a macroblock using either the previous or future reference frame, and therefore calculates one motion vector for the macroblock. In direct and interpolated modes, an encoder predicts a macroblock in a current frame using both reference frames. In interpolated mode, the encoder explicitly calculates two motion vectors for the macroblock. In direct mode, the encoder derives implied motion vectors by scaling the co-located motion vector in the future reference frame, and therefore does not explicitly calculate any motion vectors for the macroblock.
D. Interlace Coding
A typical interlace video frame consists of two fields scanned at different times. For example, referring to FIG. 6, an interlace video frame 600 includes top field 610 and bottom field 620. Typically, the odd-numbered lines (top field) are scanned at one time (e.g., time t) and the even-numbered lines (bottom field) are scanned at a different (typically later) time (e.g., time t+1). This arrangement can create jagged tooth-like features in regions of a frame where motion is present because the two fields are scanned at different times. On the other hand, in stationary regions, image structures in the frame may be preserved (i.e., the interlace artifacts visible in motion regions may not be visible in stationary regions). Macroblocks in interlace frames can be field-coded or frame-coded. In field-coded macroblocks, the top-field lines and bottom-field lines are rearranged, such that the top field lines appear at the top of the macroblock, and the bottom field lines appear at the bottom of the macroblock. Predicted field-coded macroblocks typically have one motion vector for each field in the macroblock. In frame-coded macroblocks, the field lines alternate between top-field lines and bottom-field lines. Predicted frame-coded macroblocks typically have one motion vector for the macroblock.
E. Standards for Video Compression and Decompression
Aside from WMV8, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262, and H.263 standards from the International Telecommunication Union [“ITU”]. Like WMV8, these standards use a combination of intraframe and interframe compression.
For example, advanced video compression or encoding techniques (including techniques in the MPEG, H.26× and WMV8 standards) are based on the exploitation of temporal coherence of typical video sequences. Image areas are tracked as they move over time, and information pertaining to the motion of these areas is compressed as part of the bit stream. Traditionally, a standard P-frame is encoded by computing and storing motion information in the form of two-dimensional displacement vectors corresponding to regularly-sized image tiles (e.g., macroblocks) For example, a macroblock may have one motion vector (a 1MV macroblock) for the macroblock or a motion vector for each of four blocks in the macroblock (a 4MV macroblock). Subsequently, the difference between the input frame and its motion compensated prediction is compressed, usually in a suitable transform domain, and added to an encoded bit stream. Typically, the motion vector component of the bitstream makes up between 10% and 30% of the size. Therefore, it can be appreciated that efficient motion vector coding is a key factor in efficient video compression.
Motion vector coding efficiency can be achieved in different ways. For example, motion vectors are often highly correlated between neighboring macroblocks. For efficiency, a motion vector of a given macroblock can be differentially coded from its prediction based on a causal neighborhood of adjacent macroblocks. A few exceptions to this general rule are observed in prior algorithms, such as those described in MPEG-4 and WMV8:                1 When the predicted motion vector lies outside a certain area (typically ±16 pixels from zero, for either component), the prediction is pulled back to the nearest point within this area.        2 When the vectors making up the causal neighborhood of the current macroblock are diverse (e.g., at motion discontinuities), the “Hybrid Motion Vector” mode is employed—the prediction is signaled by a codeword that indicates whether to use the motion vector to the top or to the left (or any other combination).        3 When a macroblock is essentially unchanged from its reference frame (i.e., a (0, 0) motion vector (no motion) and no residual components), it is indicated as being “skipped.”        4 A macroblock may be coded as intra (i.e., not differentially predicted from the previous frame). In this case, no motion vector is sent. (Otherwise, for non-skipped macroblocks that are not intra coded, a motion vector is always sent.)        5 Intra coded macroblocks are indicated by an “I/P switch”, which is jointly coded with a coded block pattern (or CBP). The CBP indicates which of the blocks making up a macroblock have attached residual information.        
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.