Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to a preceding and/or following frame (typically called a reference or anchor frame) or frames (for B-frames).
Inter-picture compression techniques often use motion estimation and motion compensation. For motion estimation, for example, an encoder divides a current predicted frame into 8×8 or 16×16 pixel units. For a unit of the current frame, a similar unit in a reference frame is found for use as a predictor. A motion vector indicates the location of the predictor in the reference frame. In other words, the motion vector for a unit of the current frame indicates the displacement between the spatial location of the unit in the current frame and the spatial location of the predictor in the reference frame. The encoder computes the sample-by-sample difference between the current unit and the predictor to determine a residual (also called error signal). If the current unit size is 16×16, the residual is divided into four 8×8 blocks. To each 8×8 residual, the encoder applies a reversible frequency transform operation, which generates a set of frequency domain (i.e., spectral) coefficients. A discrete cosine transform [“DCT”] is a type of frequency transform. The resulting blocks of spectral coefficients are quantized and entropy encoded. If the predicted frame is used as a reference for subsequent motion compensation, the encoder reconstructs the predicted frame. When reconstructing residuals, the encoder reconstructs transforms coefficients (e.g., DCT coefficients) that were quantized and performs an inverse frequency transform such as an inverse DCT [“IDCT”]. The encoder performs motion compensation to compute the predictors, and combines the predictors with the residuals. During decoding, a decoder typically entropy decodes information and performs analogous operations to reconstruct residuals, perform motion compensation, and combine the predictors with the residuals.
I. Inter Compression in Windows Media Videos Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
Inter compression in the WMV8 encoder uses block-based motion-compensated prediction coding followed by transform coding of the residual error. FIGS. 1 and 2 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 1 illustrates motion estimation for a predicted frame (110) and FIG. 2 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 1, the WMV8 encoder computes a motion vector for a macroblock (115) in the predicted frame (110). To compute the motion vector, the encoder searches in a search area (135) of a reference frame (130). Within the search area (135), the encoder compares the macroblock (115) from the predicted frame (110) to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock.
Since a motion vector value is often correlated with the values of spatially surrounding motion vectors, compression of the data used to transmit the motion vector information can be achieved by determining or selecting a motion vector predictor from neighboring macroblocks and predicting the motion vector for the current macroblock using the motion vector predictor. The encoder can encode the differential between the motion vector and the motion vector predictor. For example, the encoder computes the difference between the horizontal component of the motion vector and the horizontal component of the motion vector predictor, computes the difference between the vertical component of the motion vector and the vertical component of the motion vector predictor, and encodes the differences.
After reconstructing the motion vector by adding the differential to the motion vector predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock (115) using information from the reference frame (130), which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock (115) itself.
FIG. 2 illustrates an example of computation and encoding of an error block (235) in the WMV8 encoder. The error block (235) is the difference between the predicted block (215) and the original current block (225). The encoder applies a discrete cosine transform [“DCT”] (240) to the error block (235), resulting in an 8×8 block (245) of coefficients. The encoder then quantizes (250) the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients (255). The encoder scans (260) the 8×8 block (255) into a one-dimensional array (265) such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding (270). The encoder selects an entropy code from one or more run/level/last tables (275) and outputs the entropy code.
FIG. 3 shows an example of a corresponding decoding process (300) for an inter-coded block. In summary of FIG. 3, a decoder decodes (310, 320) entropy-coded information representing a prediction residual using variable length decoding 310 with one or more run/level/last tables (315) and run length decoding (320). The decoder inverse scans (330) a one-dimensional array (325) storing the entropy-decoded information into a two-dimensional block (335). The decoder inverse quantizes and inverse discrete cosine transforms (together, 340) the data, resulting in a reconstructed error block (345). In a separate motion compensation path, the decoder computes a predicted block (365) using motion vector information (355) for displacement from a reference frame. The decoder combines (370) the predicted block (365) with the reconstructed error block (345) to form the reconstructed block (375).
II. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing in raster scan fashion through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
The primary aspect of interlaced video is that the raster scan of an entire video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. This results in each frame containing two fields representing two different time epochs. FIG. 4 shows an interlaced video frame (400) that includes top field (410) and bottom field (420). In the frame (400), the even-numbered lines (top field) are scanned starting at one time (e.g., time t), and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
III. Previous Coding and Decoding in a WMV Encoder and Decoder
Previous software for a WMV encoder and decoder, released in executable form, has used coding and decoding of progressive and interlaced P-frames. While the encoder and decoder are efficient for many different encoding/decoding scenarios and types of content, there is room for improvement in several places.
A. Reference Pictures for Motion Compensation
The encoder and decoder use motion compensation for progressive and interlaced forward-predicted frames. For a progressive P-frame, motion compensation is relative to a single reference frame, which is the previously reconstructed I-frame or P-frame that immediately precedes the current P-frame. Since the reference frame for the current P-frame is known and only one reference frame is possible, information used to select between multiple reference frames is not needed.
The macroblocks of an interlaced P-frame may be field-coded or frame-coded. In a field-coded macroblock, up to two motion vectors are associated with the macroblock, one for the top field and one for the bottom field. In a frame-coded macroblock, up to one motion vector is associated with the macroblock. For a frame-coded macroblock in an interlaced P-frame, motion compensation is relative to a single reference frame, which is the previously reconstructed I-frame or P-frame that immediately precedes the current P-frame. For a field-coded macroblock in an interlaced P-frame, motion compensation is still relative to the single reference frame, but only the lines of the top field of the reference frame are considered for a motion vector for the top field of the field-coded macroblock, and only the lines of the bottom field of the reference frame are considered for a motion vector for the bottom field of the field-coded macroblock. Again, since the reference frame is known and only one reference frame is possible, information used to select between multiple reference frames is not needed.
In certain encoding/decoding scenarios (e.g., high bit rate interlaced video with lots of motion), limiting motion compensation for forward prediction to be relative to a single reference can hurt overall compression efficiency.
B. Signaling Macroblock Information
The encoder and decoder use signaling of macroblock information for progressive or interlaced P-frames.