Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, a frame-coded interlaced video frame (having alternating lines for video fields), or an interlaced video field. Intra-picture compression techniques compress individual pictures (typically called I-pictures or key pictures), and inter-picture compression techniques compress pictures (typically called predicted pictures, P-pictures, or B-pictures) with reference to preceding and/or following pictures (typically called reference or anchor pictures).
Intra-picture compression techniques often use a frequency transform and quantization to exploit spatial redundancy within a picture. For example, an encoder divides an intra-coded picture into 8×8 pixel blocks. To each 8×8 block, the encoder applies a frequency transform, which generates a set of frequency domain (i.e., spectral) coefficients. The resulting spectral coefficients are quantized and entropy encoded. During decoding, a decoder typically performs the inverse of the encoder operations. For example, the decoder performs entropy decoding, inverse quantization, and an inverse frequency transform.
Inter-picture compression techniques often use motion estimation and motion compensation to exploit temporal redundancy between pictures. For example, for motion estimation an encoder divides a current predicted picture into 16×16 macroblocks. For a macroblock of the current picture, a similar area in a reference picture is found for use as a predictor. A motion vector indicates the location of the predictor in the reference picture. In other words, the motion vector for the macroblock of the current picture indicates the displacement between the spatial location of the macroblock in the current picture and the spatial location of the predictor in the reference picture. The encoder computes the sample-by-sample difference between the current macroblock and the predictor to determine a residual (also called error signal). To blocks of the residual, the encoder applies a frequency transform. The resulting spectral coefficients are quantized and entropy encoded. During decoding, a decoder typically performs the inverse of various encoder operations. For example, for a residual, the decoder performs entropy decoding, an inverse quantization, and an inverse frequency transform. The decoder also performs motion compensation and combines the predictors with reconstructed residuals. If an intra-coded or inter-coded picture is used as a reference for subsequent motion compensation, the encoder also reconstructs the picture.
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, lines of an entire frame are scanned in raster scan fashion (left to right, top to bottom) starting at a single time instant. The lines are successive and non-alternating.
The raster scan of an interlaced video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. So, in an interlaced video frame, the even-numbered lines (top field) may be scanned starting at one time (e.g., time t), with the odd-numbered lines (bottom field) scanned starting at a different (typically later) time (e.g., time t+1). This can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
II. Motion Vector Prediction in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra- and inter-compression, and the WMV8 decoder uses intra- and inter-decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
The WMV8 and WMV9 codecs use motion vector prediction to reduce the bit rate associated with signaling of motion vector information. The value of a motion vector for a current block or macroblock is often correlated with the values of motion vectors for spatially surrounding blocks or macroblocks. Motion vector compression can be achieved by determining or selecting a motion vector predictor from neighboring macroblocks or blocks, and predicting the motion vector for the current macroblock or block using the motion vector predictor. The encoder then encodes the differential between the motion vector and the motion vector predictor. For example, the encoder computes the difference between the horizontal component of the motion vector and the horizontal component of the motion vector predictor, computes the difference between the vertical component of the motion vector and the vertical component of the motion vector predictor, and encodes the differences.
A corresponding decoder uses motion vector prediction when reconstructing the motion vector. For a motion vector, the decoder determines a motion vector predictor from neighboring macroblocks or blocks (as was done in the encoder, using the same contextual information), decodes a differential for the motion vector, and reconstructs the motion vector from the motion vector predictor and differential.
Motion vector prediction in WMV8 and WMV9 varies depending on the location of the current macroblock (or block) in the current picture (e.g., top row, left column, interior) and whether neighbors have motion vectors for blocks or macroblocks. In WMV9, motion vector prediction also varies depending on video picture type (e.g., progressive frame, interlaced frame). Motion vector prediction in WMV8 and WMV9 provides good performance in many cases. Separate coding of fields of interlaced video frames is not supported in WMV8 and WMV9, however, so the motion vector prediction mechanisms in WMV8 and WMV9 do not address the particular requirements of motion vector prediction for separately coded fields.
III. Motion Vector Prediction in Standards
Aside from previous WMV encoders and decoders, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another name for MPEG 2), H.263, and H.264 standards from the International Telecommunication Union [“ITU”]. Each of these standards specifies some form of motion vector prediction, although the details of the motion vector prediction vary widely between the standards.
Motion vector prediction is simplest in the H.261 standard, for example, in which the motion vector predictor for the motion vector of a current macroblock is generally the motion vector of the previously coded/decoded macroblock. [H.261 standard, section 4.2.3.4.] Motion vector prediction is similar in the MPEG-1 standard. [MPEG-1 standard, sections 2.4.4.2 and D.6.2.3.]
H.262 specifies more complex motion vector prediction. For a given macroblock, motion vector predictors may be tracked for the forward and backward directions for the whole macroblock or for each of the top and bottom halves of the macroblock. [H.262 standard, section 7.6.3.] For a given motion vector, the motion vector predictor is still typically determined from a single neighbor. Even though separate coding of fields of interlaced video frames is supported in H.262, motion vector prediction for such separately coded fields does not effectively account for polarity changes or changes in distance between a current field and reference field(s).
Other standards (such as H.263, MPEG-4, draft JVT-D157 of H.264) determine a motion vector predictor from multiple different neighbors with different candidate motion vector predictors. [H.263 standard, sections 6.1.1; MPEG-4 standard, sections 7.5.5 and 7.6.2; and F.2; JVT-D157, section 8.4.1.] These are efficient for some kinds of motion. Even when separate coding of fields of interlaced video frames is supported, however, motion vector prediction for separately coded fields does not effectively account for polarity changes or changes in distance between the current field and reference field(s).
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.