Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits. For instance, a pixel may comprise an 8-bit luminance value (also called a luma value) that defines the grayscale component of the pixel and two 8-bit chrominance values (also called chroma values) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer, but decreases in the bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers, but decreases in the bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include intraframe compression and interframe compression. Intraframe compression techniques compress individual frames, typically called I-frames or key frames. Interframe compression techniques compress frames with reference to preceding and/or following frames, which are typically called predicted frames, P-frames, or B-frames.
Microsoft Corporation's Windows Media Video, Version 8 (“WMV8”) includes a video encoder and a video decoder. The WMV8 encoder uses intraframe and interframe compression, and the WMV8 decoder uses intraframe and interframe decompression. Interframe compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error.
In WMV8, a frame is represented as three pixel planes: a luminance (Y) plane of luminance pixel values and two chrominance (U, V) planes of chrominance pixel values. The resolution of the Y plane is double the resolution of the U and V planes horizontally and vertically. So, a 320 pixel×240 pixel frame has a 320 pixel×240 pixel Y plane and 160 pixel×120 pixel U and V planes.
The WMV8 encoder splits a predicted frame into 8×8 blocks of pixels. Groups of four 8×8 luminance blocks and two co-located 8×8 chrominance blocks (one for the U chrominance plane, and one for the V chrominance plane) form 16×16 macroblocks. Thus, each 16×16 macroblock includes four 8×8 luminance blocks and two 8×8 chrominance blocks.
For a macroblock of a predicted frame, the WMV8 encoder performs motion estimation. The motion estimation approximates the motion of a macroblock in a predicted frame by searching for and matching the macroblock in the predicted frame with a macroblock from a reference frame. In FIG. 1, for instance, the WMV8 encoder computes a motion vector for a macroblock (115) in the predicted frame (110). To compute the motion vector, the encoder searches in a search area (135) of a reference frame (130). Within the search area (135), the encoder compares the luminance values of the macroblock (115) from the predicted frame (110) to the luminance values of various candidate blocks from the reference frame (130) in order to find a good match. The WMV8 encoder may switch motion vector accuracy, and may use a search range and motion vectors with integer, half, or quarter-pixel horizontal resolution and integer or half-pixel vertical resolution. With sub-pixel accurate motion vectors, the WMV8 encoder can approximate sub-pixel motion in a video sequence.
During motion compensation, the WMV8 encoder uses the motion vectors for macroblocks of the predicted frame to determine the predictors for the macroblocks from the reference frame. For each of the motion-predicted macroblocks, the WMV8 encoder computes the difference (called the residual or error) between the original macroblock and its predictor. The WMV8 encoder splits the residual into blocks and lossy compresses the residual blocks. To reconstruct the motion-predicted macroblocks of the predicted frame, the WMV8 encoder decompresses the residuals and adds them to the predictors for the respective macroblocks.
The WMV8 decoder also uses the motion vectors for macroblocks of the predicted frame to determine the predictors for the macroblocks from the reference frame. To reconstruct the motion-predicted macroblocks of the predicted frame, the WMV8 decoder decompresses the residuals and adds them to the predictors for the macroblocks.
During motion estimation or compensation, when a motion vector has sub-pixel accuracy (i.e., half-pixel or quarter-pixel), the WMV8 encoder or decoder must determine pixel values at sub-pixel positions in the reference frame. The WMV8 encoder or decoder generates values for sub-pixel positions using interpolation filters. FIG. 2 shows sub-pixel sample positions H0, H1, H2, which have values computed by interpolation of integer-pixel values a, b, c, . . . , p.
When operating with half-pixel motion vector accuracy, the interpolation filters used for luminance pixel values at the three distinct half-pixel positions H0, H1, H2 are:H0=(f+g+R2)>>1  (1),H1=(f+j+R2)>>1  (2),andH2=(f+g+j+k+R1)>>2  (3),where R1 and R2 are rounding control values that are controlled by a one-bit rounding-control flag that indicates the rounding mode for a particular frame. If the rounding-control flag is set to 0, then R1=2 and R2=1. If the rounding-control flag is set to 1, then R1=R2=0. The value of the rounding-control flag alternates between 1 and 0 for each P-frame. At each I frame, the value of the rounding-control flag is reset to 0. Thus, the rounding control operates on a frame-by-frame basis.
Equations 1, 2, and 3 are examples of bilinear interpolation. Bilinear interpolation is fast and tends to smooth pixel values. The smoothing may have desirable effects (such as decreasing perceptibility of quantization noise), but it can also lead to loss of valid pixel information.
For quarter-pixel motion vector resolution, the WMV8 encoder or decoder first employs bicubic filters to interpolate luminance pixel values at half-pixel positions. Bicubic interpolation is slower than bilinear interpolation, but tends to preserve edge values and result in less loss of valid pixel information. The bicubic filters for the three distinct half-pixel positions H0, H1, H2 are:H0=(−e+9f+9g−h+8)>>4  (4),H1=(−b+9f+9j−n+8)>>4  (5), andH2=(−t0+9t1+9t2−t3+8)>>4  (6),where t0, t1, t2, t3 are computed as follows:t0=(−a+9b+9c−d+8)>>4  (7),t1=(−e+9f+9g−h+8)>>4  (8),t2=(−i+9j+9k−I+8)>>4  (9), andt3=(−m+9n+9o−p+8)>>4  (10).
Equations (4)–(10) can result in output outside of the range of input values. For example, for 8-bit input (range 0 . . . 255), the series of values 0 255 255 0 produces an output value of 287 in any of equations (4)–(10). So, the WMV8 encoder or decoder clamps (or, “clips”) the output value of any of equations (4)–(10) to be within the valid range. For example, for 8-bit output values, values less than 0 are changed to 0, and values greater than 255 are changed to 255. Clamping addresses the range problem, but slows down computation. In addition, clamping results in loss of precision.
The WMV8 encoder or decoder then computes pixel values at certain quarter-pixel positions in a subsequent stage of interpolation. These quarter-pixel locations are situated horizontally in between either two half-pixel locations or an integer-pixel location and a half-pixel location. For these quarter-pixel locations, the WMV8 encoder or decoder uses bilinear interpolation (i.e., (x+y+1)>>1) using the two horizontally neighboring half-pixel/integer-pixel locations without rounding control.
Once luminance motion vectors are computed, the WMV8 encoder or decoder derives co-located chrominance motion vectors. Because a chrominance plane in WMV8 is half as large as a luminance plane both horizontally and vertically, luminance motion vector values must be scaled into appropriate chrominance motion vector values. In WMV8, this conversion process includes halving the luminance motion vectors and rounding the resulting chrominance motion vectors to half-pixel accuracy. Thus, luminance motion vectors having half-pixel accuracy are not converted to chrominance motion vectors having quarter-pixel accuracy. Moreover, chrominance rounding in WMV8 operates in a single mode that cannot be modified or selected by the user.
In WMV8, the pixel values at sub-pixel positions in a reference frame may exhibit underflow or overflow in some circumstances. For example, the luminance pixel value at a quarter-pixel position may be 271 (which is outside the range of 0 . . . 255) if the neighboring integer-pixel position value is 255 and the neighboring half-pixel position value is 287 (0+9*255+9*255−0+8>>4=287)(255+287+1>>1=271). To address this problem, after adding the residual blocks to the predictor for a macroblock, the WMV8 encoder and decoder clamp reconstructed values for the macroblock to be within the range of 0 . . . 255, if necessary.
Aside from WMV8, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262, and H.263 standards from the International Telecommunication Union [“ITU”]. Like WMV8, these standards use a combination of intraframe and interframe compression, although the standards typically differ from WMV8 in the details of the compression techniques used.
Several standards (e.g., MPEG 4 and H.263) provide for half-pixel motion estimation and compensation using bilinear filters and basic rounding control. Moreover, in H.263, chrominance motion vectors which theoretically have quarter-pixel resolution (i.e., one half of the resolution of the half-pixel luminance motion vectors) are rounded to either half-pixel or full-pixel accuracy so that no quarter-pixel values are allowed in chrominance space. For additional detail about motion estimation/compensation in the standards, see the standards' specifications themselves.
Motion estimation and compensation are effective compression techniques, but the various previous motion estimation/compensation techniques (as in WMV8 and the standards discussed above) have several disadvantages, including:
(1) When computing pixel values at sub-pixel positions in reference frames, the encoders and decoders unnecessarily lose precision in intermediate values. For instance, when computing the pixel value for a quarter-pixel position in WMV8, the intermediate values at half-pixel positions are right-shifted by four bits despite the fact that a greater bit depth might be available. Further, the WMV8 encoder/decoder clamps intermediate values during the two-stage interpolation of quarter-pixel positions, which slows down computation and results in the unnecessary loss of precision.
(2) Interpolation for pixel values in quarter-pixel motion estimation and compensation is inefficient in many cases. For example, in WMV8, the calculation of a one-dimensional quarter-pixel position requires the use of a filter for a half-pixel position followed by use of a bilinear filter.
(3) The encoders and decoders fail to account for the accumulation of rounding error that might be created in multi-stage interpolation. Rounding error occurs, for example, when pixel values are repeatedly rounded down from frame to frame in a video sequence. This rounding error can cause perceptible artifacts in low-quality, low-bitrate video sequences. For instance, when the WMV8 encoder and decoder interpolate for a pixel value at a quarter-pixel position in multiple stages, rounding control is not used. Instead, the results of each stage are rounded in the same fashion in each stage of interpolation (and without rounding control); and
(4) Chrominance rounding is not performed to quarter-pixel accuracy, and no control is given over chrominance motion vector rounding options. For example, the WMV8 encoder and decoder round all chrominance motion vectors to a half-pixel value and operate in only a single mode.
Given the critical importance of motion estimation and compensation to digital video, it is not surprising that motion estimation and compensation are richly developed fields. Whatever the benefits of previous motion estimation and compensation techniques, however, they do not have the advantages of the following techniques and tools.