Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to a preceding and/or following frame (typically called a reference or anchor frame) or frames (for B-frames).
Inter-picture compression techniques often use motion estimation and motion compensation. For motion estimation, for example, an encoder divides a current predicted frame into 8×8 or 16×16 pixel units. For a unit of the current frame, a similar unit in a reference frame is found for use as a predictor. A motion vector [“MV”] indicates the location of the predictor in the reference frame. In other words, the MV for a unit of the current frame indicates the displacement between the spatial location of the unit in the current frame and the spatial location of the predictor in the reference frame. The encoder computes the sample-by-sample difference between the current unit and the predictor to determine a residual (also called error signal). If the current unit size is 16×16, the residual is divided into four 8×8 blocks. To each 8×8 residual, the encoder applies a reversible frequency transform operation, which generates a set of frequency domain (i.e., spectral) coefficients. A discrete cosine transform [“DCT”] is a type of frequency transform. The resulting blocks of spectral coefficients are quantized and entropy encoded.
If the predicted frame is used as a reference for subsequent motion compensation, the encoder reconstructs the predicted frame. When reconstructing residuals, the encoder reconstructs transforms coefficients (e.g., DCT coefficients) that were quantized and performs an inverse frequency transform such as an inverse DCT [“IDCT”]. The encoder performs motion compensation to compute the predictors, and combines the predictors with the residuals.
During decoding, a decoder typically entropy decodes information and performs analogous operations to reconstruct residuals, perform motion compensation, and combine the predictors with the residuals.
Since a MV value is often correlated with the values of spatially surrounding MVs, compression of the data used to transmit the MV information can be achieved by determining or selecting a MV predictor from neighboring macroblocks and predicting the MV for the current macroblock using the MV predictor. The encoder can encode the differential [“DMV”] between the MV and the MV predictor. For example, the encoder computes the difference between the horizontal component of the MV and the horizontal component of the MV predictor, computes the difference between the vertical component of the MV and the vertical component of the MV predictor, and encodes the differences. After reconstructing the MV by adding the DMV to the MV predictor, a decoder uses the MV to compute a prediction macroblock for the macroblock using information from the reference frame, which is a previously reconstructed frame available at the encoder and the decoder.
I. Inter Compression in Windows Media Video, Version 9
Microsoft Corporation's Windows Media Video, Version 9 [“WMV9”] includes a video encoder and a video decoder. The encoder uses intra and inter compression, and the decoder uses intra and inter decompression. The encoder and decoder may process progressive or interlaced video content.
Various configurations are allowed for MVs and macroblocks, including one MV per macroblock (1 MV macroblock), up to four luma block MVs per macroblock (4 MV macroblock) for a progressive P-frame, and one MV per top or bottom field of a field-coded macroblock in an interlaced P-frame. The rules for computing MV predictors vary for different types of content, macroblocks, and locations in a frame. However the MV predictors are computed, the various kinds of MVs are encoded as DMVs relative to the MV predictors.
The encoder and decoder use extended range MVs in some cases. The capability to use extended range MVs is signaled at sequence layer for a video sequence. If extended range MVs are allowed in a progressive P-frame, for example, the range for MVs is signaled at picture layer for the progressive P-frame. A default MV range is used when an extended MV range is not used.
A single MVDATA element is associated with all blocks in a 1 MV macroblock. MVDATA signals whether the blocks are coded as intra or inter type. If they are coded as inter, then MVDATA also indicates the DMV. Individual blocks within a 4 MV macroblock can be coded as intra blocks. For each of the four luminance blocks of a 4 MV macroblock, the intra/inter state is signaled by a BLKMVDATA element associated with that block. For a 4 MV macroblock, a CBPCY element indicates which blocks have BLKMVDATA elements present in the bitstream.
More specifically, a MVDATA or BLKMVDATA element jointly encodes three things: (1) the horizontal DMV component; (2) the vertical DMV component; and (3) a binary “last” flag that generally indicates whether transform coefficients are present. Whether the macroblock (or block, for 4 MV) is intra or inter-coded is signaled as one of the DMV possibilities. The pseudocode in FIG. 1A illustrates how DMV information, inter/intra type, and last flag information are decoded for MVDATA or BLKMVDATA. In the pseudocode, the variable intra_flag is a binary flag indicating whether the block or macroblock is intra. The variables dmv_x and dmv_y are horizontal and vertical DMV components, respectively. The variables k_x and k_y are fixed lengths for extended range MVs, whose values vary as shown in the table in FIG. 1B. The variable halfpel_flag is a binary value indicating whether half-pixel of quarter-pixel precision is used for the MV, and whose value is set based on picture layer syntax elements. Finally, the tables size_table and offset_table are arrays defined as follows:
size_table[6]={0, 2, 3, 4, 5, 8}, and
offset_table[6]={0, 1, 3, 7, 15, 31}.
In a field-coded macroblock of an interlaced P-field, a TOPMVDATA element is associated with the top field blocks, and a BOTMVDATA element is associated with the bottom field blocks. TOPMVDATA indicates whether the top field blocks are intra or inter. If they are inter, then TOPMVDATA also indicates the DMV for the top field blocks. Likewise, BOTMVDATA signals the inter/intra state for the bottom field blocks, and potential DMV information for the bottom field blocks. CBPCY indicates which fields have MV data elements present in the bitstream. For frame-coded macroblocks (1 MV) or field-coded macroblocks of interlaced P-frames, MVDATA, TOPMVDATA, and BOTMVDATA elements are decoded the same way as MVDATA and BLKMVDATA for MVs for progressive P-frames.
While the WMV9 encoder and WMV9 decoder are efficient for many different encoding/decoding scenarios and types of content, there is room for improvement in several places. In particular, coding of DMV information is inefficient in certain high-motion scenes with complex motion. For such scenes, MV prediction is not particularly effective, and a large number of DMVs are signaled with escape coding (i.e., the escape code and fixed length codes [“FLCs”]).
VI. Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another name for MPEG 2), H.263, and H.264 standards from the International Telecommunication Union [“ITU”]. An encoder and decoder complying with one of these standards typically use motion estimation and compensation to reduce the temporal redundancy between pictures.
Each of H.261, H.262, H.263, MPEG-1, MPEG-4, and H.264 specifies some form of DMV coding and decoding, although the details of the coding and decoding vary widely between the standards. DMV coding and decoding is simplest in the H.261 standard, for example, in which one variable length code [“VLC”] represents the horizontal differential component, and another VLC represents the vertical differential component. [H.261 standard, section 4.2.3.4.] Other standards specify more complex coding and decoding for DMV information. For additional detail, see the respective standards.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.