Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include intraframe compression and interframe compression. Intraframe compression techniques compress individual frames, typically called l-frames or key frames. Interframe compression techniques compress frames with reference to preceding and/or following frames, which are typically called predicted frames, P-frames, or B-frames.
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intraframe and interframe compression, and the WMV8 decoder uses intraframe and interframe decompression.
A. Intraframe Compression in WMV8
FIG. 1 shows an example of block-based intraframe compression 100 of a block 105 of pixels in a key frame in the WMV8 encoder. For example, the WMV8 encoder splits a key video frame into 8×8 blocks of pixels and applies an 8×8 discrete cosine transform [“DCT”] 110 to individual blocks, converting the 8×8 block of pixels 105 into an 8×8 block of DCT coefficients 115. The encoder quantizes 120 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 125 which the encoder then prepares for entropy encoding.
The encoder encodes the DC coefficient 126 as a differential from the DC coefficient 136 of a previously encoded neighbor (e.g., neighbor block 135) of the block being encoded. The encoder entropy encodes the differential 140. FIG. 1 shows the left column 127 of AC coefficients encoded as a differential 147 from the left column 137 of the neighboring (to the left) block 135. The remaining AC coefficients are from the block 125 of quantized DCT coefficients.
The encoder scans 150 the 8×8 block 145 of predicted, quantized AC DCT coefficients into a one-dimensional array 155 and then entropy encodes the scanned AC coefficients using a variation of run length coding 160. The encoder selects an entropy code from one or more run/level/last tables 165 and outputs the entropy code 170.
B. Interframe Compression in WMV8
Interframe compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 2 and 3 illustrate the block-based interframe compression for a predicted frame in the WMV8 encoder. In particular, FIG. 2 illustrates motion estimation for a predicted frame 210 and FIG. 3 illustrates compression of a prediction residual for a motion-estimated block of a predicted frame.
For example, the WMV8 encoder splits a predicted frame into 8×8 blocks of pixels. Groups of four 8×8 blocks form macroblocks. For each macroblock, a motion estimation process is performed. The motion estimation approximates the motion of the macroblock of pixels relative to a reference frame, for example, a previously coded, preceding frame. In FIG. 2, the WMV8 encoder computes a motion vector for a macroblock 215 in the predicted frame 210. To compute the motion vector, the encoder searches in a search area 235 of a reference frame 230. Within the search area 235, the encoder compares the macroblock 215 from the predicted frame 210 to various candidate macroblocks in order to find a candidate macroblock that is a good match. After the encoder finds a good matching macroblock, the encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock so the decoder can find the matching macroblock during decoding. When decoding the predicted frame 210 with motion compensation, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 215 using information from the reference frame 230. The prediction for the macroblock 215 is rarely perfect, so the encoder usually encodes 8×8 blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 215 itself.
FIG. 3 illustrates an example of computation and encoding of an error block 335 in the WMV8 encoder. The error block 335 is the difference between the predicted block 315 and the original current block 325. The encoder applies a DCT 340 to the error block 335, resulting in an 8×8 block 345 of coefficients. The encoder then quantizes 350 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 355. The quantization step size is adjustable. Quantization results in loss of precision, but not complete loss of the information for the coefficients.
The encoder then prepares the 8×8 block 355 of quantized DCT coefficients for entropy encoding. The encoder scans 360 the 8×8 block 355 into a one dimensional array 365 with 64 elements, such that coefficients are generally ordered from lowest frequency to highest frequency, which typically creates long runs of zero values.
The encoder entropy encodes the scanned coefficients using a variation of run length coding 370. The encoder selects an entropy code from one or more run/level/last tables 375 and outputs the entropy code.
FIG. 4 shows an example of a corresponding decoding process 400 for an inter-coded block. Due to the quantization of the DCT coefficients, the reconstructed block 475 is not identical to the corresponding original block. The compression is lossy.
In summary of FIG. 4, a decoder decodes (410, 420) entropy-coded information representing a prediction residual using variable length decoding 410 with one or more run/level/last tables 415 and run length decoding 420. The decoder inverse scans 430 a one-dimensional array 425 storing the entropy-decoded information into a two-dimensional block 435. The decoder inverse quantizes and inverse discrete cosine transforms (together, 440) the data, resulting in a reconstructed error block 445. In a separate motion compensation path, the decoder computes a predicted block 465 using motion vector information 455 for displacement from a reference frame. The decoder combines 470 the predicted block 465 with the reconstructed error block 445 to form the reconstructed block 475.
The amount of change between the original and reconstructed frame is termed the distortion and the number of bits required to code the frame is termed the rate for the frame. The amount of distortion is roughly inversely proportional to the rate. In other words, coding a frame with fewer bits (greater compression) will result in greater distortion, and vice versa.
C. Limitations of Conventional Bi-Directional Prediction
Bi-directionally coded images (e.g., B-frames) use two images from the source video as reference (or anchor) images. For example, among anchor frames for a typical B-frame, one anchor frame is from the temporal past and one anchor frame is from the temporal future. For example, referring to FIG. 5, a B-frame 510 in a video sequence has a temporally previous reference frame 520 and a temporally future reference frame 530.
Some conventional encoders use five prediction modes (forward, backward, direct, interpolated and intra) to predict regions in a current B-frame. In intra mode, for example, an encoder intra-codes macroblocks. Intra-coded macroblocks are not predicted from either reference image. In the forward and backward modes, an encoder predicts macroblocks using one reference frame. Forward mode is for predicting macroblocks using the previous reference frame (e.g., previous reference frame 520), and backward mode is for predicting macroblocks using the future reference frame (e.g., future reference frame 530).
In the direct and interpolated modes, an encoder predicts macroblocks in a current frame using both previous reference frame 520 and future reference frame 530. For example, in interpolated mode, an encoder predicts macroblocks by averaging a prediction from the previous frame using a forward pointing motion vector and a prediction from the future frame using a backward pointing motion vector. For example, an encoder using interpolated mode to encode a macroblock predictively signals the actual forward and backward pointing motion vectors for the macroblock in the bit stream. In other words, two motion vectors are explicitly calculated for the macroblock and sent to the decoder (or receiver).
In direct mode, however, an encoder derives implied forward and backward pointing motion vectors by scaling the co-located motion vector in the future anchor frame. For example, an encoder scales the motion vector for the macroblock in the anchor frame having the same horizontal and vertical index as the macroblock currently being encoded.
FIG. 6 outlines the way in which direct mode prediction works in many prior implementations. To derive implied forward and backward motion vectors (MVF and MVB, respectively) for the macroblock 610 being encoded in the B-frame 620, an encoder scales the motion vector (MV) of the corresponding macroblock in the future reference frame 630 (e.g., a P-frame) using timestamps, as follows:MVF=(TRB*MV)/TRD  (1)MVB=(TRB−TRD)*MV/TRD  (2)TRD is the temporal distance between the previous reference frame 640 (e.g., a P-frame) and the future reference frame 630, and TRB is the temporal distance between the current frame and previous reference frame. The encoder calculates temporal distances based on timestamps for the frames.
In the example shown in FIG. 6, TRD=2 and TRB=1. The encoder uses the two implied motion vectors to address macroblocks in the previous reference frame 640 and the future reference frame 630, and the average of these is used to predict the macroblock 610 being encoded. For example, in FIG. 6, MVF=(dx/2, dy/2) and MVB=(−dx/2, −dy/2).
Direct mode prediction of this nature imposes restrictions on the encoder and decoder. For example, the encoding and decoding of the bit stream is dependent on timestamps. This dependency can cause system architecture problems, because timestamps are more of a system layer primitive (e.g., in ASF or other streaming formats) than a bit stream primitive. By using absolute time stamps and true frame distances (e.g., TRB and TRD), prior implementations also impose a constant velocity assumption on the motion being modeled: When this assumption is inappropriate in view of the actual motion between the two reference frames, direct mode prediction can become expensive or altogether inapplicable.
D. Standards for Video Compression and Decompression
Aside from WMV8, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262, and H.263 standards from the International Telecommunication Union [“ITU”]. Like WMV8, these standards use a combination of intraframe and interframe compression.
For example, the MPEG 4 standard describes bi-directional motion compensation in video object planes, including “direct” mode motion compensation, in which motion vectors for a video object plane are derived by scaling motion vectors of co-located macroblocks in temporally previous or future video object planes (such as previous or future intra-coded or inter-coded video object planes) based on time stamps.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.