Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to a preceding and/or following frame (typically called a reference or anchor frame) or frames (for B-frames).
Inter-picture compression techniques often use motion estimation and motion compensation. For motion estimation, for example, an encoder divides a current predicted frame into 8×8 or 16×16 pixel units. For a unit of the current frame, a similar unit in a reference frame is found for use as a predictor. A motion vector indicates the location of the predictor in the reference frame. In other words, the motion vector for a unit of the current frame indicates the displacement between the spatial location of the unit in the current frame and the spatial location of the predictor in the reference frame. The encoder computes the sample-by-sample difference between the current unit and the predictor to determine a residual (also called error signal). If the current unit size is 16×16, the residual is divided into four 8×8 blocks. To each 8×8 residual, the encoder applies a reversible frequency transform operation, which generates a set of frequency domain (i.e., spectral) coefficients. A discrete cosine transform [“DCT”] is a type of frequency transform. The resulting blocks of spectral coefficients are quantized and entropy encoded. If the predicted frame is used as a reference for subsequent motion compensation, the encoder reconstructs the predicted frame. When reconstructing residuals, the encoder reconstructs transforms coefficients (e.g., DCT coefficients) that were quantized and performs an inverse frequency transform such as an inverse DCT [“IDCT”]. The encoder performs motion compensation to compute the predictors, and combines the predictors with the residuals. During decoding, a decoder typically entropy decodes information and performs analogous operations to reconstruct residuals, perform motion compensation, and combine the predictors with the residuals.
I. Inter Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
Inter compression in the WMV8 encoder uses block-based motion-compensated prediction coding followed by transform coding of the residual error. FIGS. 1 and 2 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 1 illustrates motion estimation for a predicted frame (110) and FIG. 2 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 1, the WMV8 encoder computes a motion vector for a macroblock (15) in the predicted frame (110). To compute the motion vector, the encoder searches in a search area (135) of a reference frame (130). Within the search area (135), the encoder compares the macroblock (115) from the predicted frame (110) to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock.
Since a motion vector value is often correlated with the values of spatially surrounding motion vectors, compression of the data used to transmit the motion vector information can be achieved by determining or selecting a motion vector predictor from neighboring macroblocks and predicting the motion vector for the current macroblock using the motion vector predictor. The encoder can encode the differential between the motion vector and the motion vector predictor. For example, the encoder computes the difference between the horizontal component of the motion vector and the horizontal component of the motion vector predictor, computes the difference between the vertical component of the motion vector and the vertical component of the motion vector predictor, and encodes the differences.
After reconstructing the motion vector by adding the differential to the motion vector predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock (115) using information from the reference frame (130), which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock (115) itself.
FIG. 2 illustrates an example of computation and encoding of an error block (235) in the WMV8 encoder. The error block (235) is the difference between the predicted block (215) and the original current block (225). The encoder applies a discrete cosine transform [“DCT”] (240) to the error block (235), resulting in an 8×8 block (245) of coefficients. The encoder then quantizes (250) the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients (255). The encoder scans (260) the 8×8 block (255) into a one-dimensional array (265) such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding (270). The encoder selects an entropy code from one or more run/level/last tables (275) and outputs the entropy code.
FIG. 3 shows an example of a corresponding decoding process (300) for an inter-coded block. In summary of FIG. 3, a decoder decodes (310, 320) entropy-coded information representing a prediction residual using variable length decoding 310 with one or more run/level/last tables (315) and run length decoding (320). The decoder inverse scans (330) a one-dimensional array (325) storing the entropy-decoded information into a two-dimensional block (335). The decoder inverse quantizes and inverse discrete cosine transforms (together, 340) the data, resulting in a reconstructed error block (345). In a separate motion compensation path, the decoder computes a predicted block (365) using motion vector information (355) for displacement from a reference frame. The decoder combines (370) the predicted block (365) with the reconstructed error block (345) to form the reconstructed block (375).
II. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing in raster scan fashion through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
The primary aspect of interlaced video is that the raster scan of an entire video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. This results in each frame containing two fields representing two different time epochs. FIG. 4 shows an interlaced video frame (400) that includes top field (410) and bottom field (420). In the frame (400), the even-numbered lines (top field) are scanned starting at one time (e.g., time t), and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
III. Previous Coding and Decoding in a WMV Encoder and Decoder
Previous software for a WMV encoder and decoder, released in executable form, has used coding and decoding of progressive and interlaced P-frames. While the encoder and decoder are efficient for many different encoding/decoding scenarios and types of content, there is room for improvement in several places.
A. Reference Pictures for Motion Compensation
The encoder and decoder use motion compensation for progressive and interlaced forward-predicted frames. For a progressive P-frame, motion compensation is relative to a single reference frame, which is the previously reconstructed I-frame or P-frame that immediately precedes the current P-frame. Since the reference frame for the current P-frame is known and only one reference frame is possible, information used to select between multiple reference frames is not needed.
The macroblocks of an interlaced P-frame may be field-coded or frame-coded. In a field-coded macroblock, up to two motion vectors are associated with the macroblock, one for the top field and one for the bottom field. In a frame-coded macroblock, up to one motion vector is associated with the macroblock. For a frame-coded macroblock in an interlaced P-frame, motion compensation is relative to a single reference frame, which is the previously reconstructed I-frame or P-frame that immediately precedes the current P-frame. For a field-coded macroblock in an interlaced P-frame, motion compensation is still relative to the single reference frame, but only the lines of the top field of the reference frame are considered for a motion vector for the top field of the field-coded macroblock, and only the lines of the bottom field of the reference frame are considered for a motion vector for the bottom field of the field-coded macroblock. Again, since the reference frame is known and only one reference frame is possible, information used to select between multiple reference frames is not needed.
In certain encoding/decoding scenarios (e.g., high bit rate interlaced video with lots of motion), limiting motion compensation for forward prediction to be relative to a single reference can hurt overall compression efficiency.
B. Signaling Macroblock Information
The encoder and decoder use signaling of macroblock information for progressive or interlaced P-frames.
1. Signaling Macroblock Information for Progressive P-frames
Progressive P-frames can be 1 MV or mixed-MV frames. A 1 MV progressive P-frame includes 1 MV macroblocks. A 1 MV macroblock has one motion vector to indicate the displacement of the predicted blocks for all six blocks in the macroblock. A mixed-MV progressive P-frame includes MV and/or 4 MV macroblocks. A 4 MV macroblock has from 0 to 4 motion vectors, where each motion vector is for one of the up to four luminance blocks of the macroblock. Macroblocks in progressive P-frames can be one of three possible types: 1 MV, 4 MV, and skipped. In addition, 1 MV and 4 MV macroblocks may be intra coded. The macroblock type is indicated by a combination of picture and macroblock layer elements.
Thus, 1 MV macroblocks can occur in 1 MV and mixed-MV progressive P-frames. A single motion vector data MVDATA element is associated with all blocks in a 1 MV macroblock. MVDATA signals whether the blocks are coded as intra or inter type. If they are coded as inter, then MVDATA also indicates the motion vector differential.
If the progressive P-frame is 1 MV, then all the macroblocks in it are 1 MV macroblocks, so there is no need to individually signal the macroblock type. If the progressive P-frame is mixed-MV, then the macroblocks in it can be 1 MV or 4 MV. In this case the macroblock type (1 MV or 4 MV) is signaled for each macroblock in the frame by a bitplane at the picture layer in the bitstream. The decoded bitplane represents the 1 MV/4 MV status for the macroblocks as a plane of one-bit values in raster scan order from upper left to lower right. A value of 0 indicates that a corresponding macroblock is coded in 1 MV mode. A value of 1 indicates that the corresponding macroblock is coded in 4 MV mode. In one coding mode, 1 MV/4 MV status information is signaled per macroblock at the macroblock layer of the bitstream (instead of as a plane for the progressive P-frame).
4 MV macroblocks occur in mixed-MV progressive P-frames. Individual blocks within a 4 MV macroblock can be coded as intra blocks. For each of the four luminance blocks of a 4 MV macroblock, the intra/inter state is signaled by the block motion vector data BLKMVDATA element associated with that block. For a 4 MV macroblock, the coded block pattern CBPCY element indicates which blocks have BLKMVDATA elements present in the bitstream. The inter/intra state for the chroma blocks is derived from the luminance inter/intra states. If two or more of the luminance blocks are coded as intra then the chroma blocks are also coded as intra.
In addition, the skipped/not skipped status of each macroblock in the frame is also signaled by a bitplane for the progressive P-frame. A skipped macroblock may still have associated information for hybrid motion vector prediction.
CBCPY is a variable-length code [“VLC”] that decodes to a six-bit field. CBPCY appears at different positions in the bitstream for 1 MV and 4 MV macroblocks and has different semantics for 1 MV and 4 MV macroblocks.
CBPCY is present in the 1 MV macroblock layer if: (1) MVDATA indicates that the macroblock is inter-coded, and (2) MVDATA indicates that at least one block of the 1 MV macroblock contains coefficient information (indicated by the “last” value decoded from MVDATA). If CBPCY is present, then it decodes to a six-bit field indicating which of the corresponding six blocks contain at least one non-zero coefficient.
CBPCY is always present in the 4 MV macroblock layer. The CBPCY bit positions for the luminance blocks (bits 0-3) have a slightly different meaning than the bit positions for chroma blocks (bits 4 and 5). For a bit position for a luminance block, a 0 indicates that the corresponding block does not contain motion vector information or any non-zero coefficients. For such a block, BLKMVDATA is not present, the predicted motion vector is used as the motion vector, and there is no residual data. If the motion vector predictors indicate that hybrid motion vector prediction is used, then a single bit is present indicating the motion vector predictor candidate to use. A 1 in a bit position for a luminance block indicates that BLKMVDATA is present for the block. BLKMVDATA indicates whether the block is inter or intra and, if it is inter, indicates the motion vector differential. BLKMVDATA also indicates whether there is coefficient data for the block (with the “last” value decoded from BLKMVDATA). For a bit position for a chroma block, the 0 or 1 indicates whether the corresponding block contains non-zero coefficient information.
The encoder and decoder use code table selection for VLC tables for MVDATA, BLKMVDATA, and CBPCY, respectively.
2. Signaling Macroblock Information for Interlaced P-frames
Interlaced P-frames may have a mixture of frame-coded and field-coded macroblocks. In a field-coded macroblock, up to two motion vectors are associated with the macroblock. In a frame-coded macroblock, up to one motion vector is associated with the macroblock. If the sequence layer element INTERLACE is 1, then a picture layer element INTRLCF is present in the bitstream. INTRLCF is a one-bit element that indicates the mode used to code the macroblocks in that frame. If INTRLCF=0 then all macroblocks in the frame are coded in frame mode. If INTRLCF=1 then the macroblocks may be coded in field or frame mode, and a bitplane INTRLCMB present in the picture layer indicates the field/frame coding status for each macroblock in the interlaced P-frame.
Macroblocks in interlaced P-frames can be one of three possible types: frame-coded, field-coded, and skipped. The macroblock type is indicated by a combination of picture and macroblock layer elements.
A single MVDATA is associated with all blocks in a frame-coded macroblock. The MVDATA signals whether the blocks are coded as intra or inter type. If they are coded as inter, then MVDATA also indicates the motion vector differential.
In a field-coded macroblock, a top field motion vector data TOPMVDATA element is associated with the top field blocks, and a bottom field motion vector data BOTMVDATA element is associated with the bottom field blocks. The elements are signaled at the first block of each field. More specifically, TOPMVDATA is signaled along with the left top field block and BOTMVDATA is signaled along with left bottom field block. TOPMVDATA indicates whether the top field blocks are intra or inter. If they are inter, then TOPMVDATA also indicates the motion vector differential for the top field blocks. Likewise, BOTMVDATA signals the inter/intra state for the bottom field blocks, and potential motion vector differential information for the bottom field blocks. CBPCY indicates which fields have motion vector data elements present in the bitstream.
A skipped macroblock is signaled by a SKIPMB bitplane in the picture layer. CBPCY and the motion vector data elements are used to specify whether blocks have AC coefficients. CBPCY is present for a frame-coded macroblock of an interlaced P-frame if the “last” value decoded from MVDATA indicates that there are data following the motion vector to decode. If CBPCY is present, it decodes to a six-bit field, one bit for each the four Y blocks, one bit for both U blocks (top field and bottom field), and one bit for both V blocks (top field and bottom field).
CBPCY is always present for a field-coded macroblock. CBPCY and the two field motion vector data elements are used to determine the presence AC coefficients in the blocks of the macroblock. The meaning of CBPCY is the same as for frame-coded macroblocks for bits 1, 3, 4 and 5. That is, they indicate the presence or absence of AC coefficients in the right top field Y block, right bottom field Y block, top/bottom U blocks, and top/bottom V blocks, respectively. For bit positions 0 and 2, the meaning is slightly different. A 0 in bit position 0 indicates that TOPMVDATA is not present and the motion vector predictor is used as the motion vector for the top field blocks. It also indicates that the left top field block does not contain any non-zero coefficients. A 1 in bit position 0 indicates that TOPMVDATA is present. TOPMVDATA indicates whether the top field blocks are inter or intra and, if they are inter, also indicates the motion vector differential. If the “last” value decoded from TOPMVDATA decodes to 1, then no AC coefficients are present for the left top field block, otherwise, there are non-zero AC coefficients for the left top field block. Similarly, the above rules apply to bit position 2 for BOTMVDATA and the left bottom field block.
The encoder and decoder use code table selection for VLC tables for MVDATA, TOPMVDATA, BOTMVDATA, and CBPCY, respectively.
3. Problems with Previous Signaling of Macroblock Information
In summary, various information for macroblocks of progressive P-frames and interlaced P-frames is signaled with separate codes (or combinations of codes) at the frame and macroblock layers. This separately signaled information includes number of motion vectors, macroblock intra/inter status, whether CBPCY is present or absent (e.g., with the “last” value for 1 MV and frame-coded macroblocks), and whether motion vector data is present or absent (e.g., with CBPCY for 4 MV and field-coded macroblocks). While this signaling provides good overall performance in many cases, it does not adequately exploit statistical dependencies between different signaled information in various common cases. Further, it does not allow and address various useful configurations such as presence/absence of CBPCY for 4 MV macroblocks, or presence/absence of motion vector data for 1 MV macroblocks.
Moreover, to the extent presence/absence of motion vector data is signaled (e.g., with CBPCY for 4 MV and field-coded macroblocks), it requires a confusing redefinition of the conventional role of the CBPCY element. This in turn requires signaling of the conventional CBPCY information with different elements (e.g., BLKMVDATA, TOPMVDATA, BOTMVDATA) not conventionally used for that purpose. And, the signaling does not allow and address various useful configurations such as presence of coefficient information when motion vector data is absent.
C. Motion Vector Prediction
For a motion vector for a macroblock (or block, or field of a macroblock, etc.) in an interlaced or progressive P-frame, the encoder encodes the motion vector by computing a motion vector predictor based on neighboring motion vectors, computing a differential between the motion vector and the motion vector predictor, and encoding the differential. The decoder reconstructs the motion vector by computing the motion vector predictor (again based on neighboring motion vectors), decoding the motion vector differential, and adding the motion vector differential to the motion vector predictor.
FIGS. 5A and 5B show the locations of macroblocks considered for candidate motion vector predictors for a 1 MV macroblock in a 1 MV progressive P-frame. The candidate predictors are taken from the left, top and top-right macroblocks, except in the case where the macroblock is the last macroblock in the row. In this case, Predictor B is taken from the top-left macroblock instead of the top-right. For the special case where the frame is one macroblock wide, the predictor is always Predictor A (the top predictor). When Predictor A is out of bounds because the macroblock is in the top row, the predictor is Predictor C. Various other rules address other special cases such as intra-coded predictors.
FIGS. 6A-10 show the locations of the blocks or macroblocks considered for the up-to-three candidate motion vectors for a motion vector for a 1 MV or 4 MV macroblock in a mixed-MV progressive P-frame. In the figures, the larger squares are macroblock boundaries and the smaller squares are block boundaries. For the special case where the frame is one macroblock wide, the predictor is always Predictor A (the top predictor). Various other rules address other special cases such as top row blocks for top row 4 MV macroblocks, top row 1 MV macroblocks, and intra-coded predictors.
Specifically, FIGS. 6A and 6B show locations of blocks considered for candidate motion vector predictors for a 1 MV current macroblock in a mixed-MV progressive P-frame. The neighboring macroblocks may be 1 MV or 4 MV macroblocks. FIGS. 6A and 6B show the locations for the candidate motion vectors assuming the neighbors are 4 MV (i.e., predictor A is the motion vector for block 2 in the macroblock above the current macroblock, and predictor C is the motion vector for block 1 in the macroblock immediately to the left of the current macroblock). If any of the neighbors is a 1 MV macroblock, then the motion vector predictor shown in FIGS. 5A and 5B is taken to be the motion vector predictor for the entire macroblock. As FIG. 6B shows, if the macroblock is the last macroblock in the row, then Predictor B is from block 3 of the top-left macroblock instead of from block 2 in the top-right macroblock as is the case otherwise.
FIGS. 7A-10 show the locations of blocks considered for candidate motion vector predictors for each of the 4 luminance blocks in a 4 MV macroblock of a mixed-MV progressive P-frame. FIGS. 7A and 7B show the locations of blocks considered for candidate motion vector predictors for a block at position 0; FIGS. 8A and 8B show the locations of blocks considered for candidate motion vector predictors for a block at position 1; FIG. 9 shows the locations of blocks considered for candidate motion vector predictors for a block at position 2; and FIG. 10 show the locations of blocks considered for candidate motion vector predictors for a block at position 3. Again, if a neighbor is a 1 MV macroblock, the motion vector predictor for the macroblock is used for the blocks of the macroblock.
For the case where the macroblock is the first macroblock in the row, Predictor B for block 0 is handled differently than block 0 for the remaining macroblocks in the row (see FIGS. 7A and 7B). In this case, Predictor B is taken from block 3 in the macroblock immediately above the current macroblock instead of from block 3 in the macroblock above and to the left of current macroblock, as is the case otherwise. Similarly, for the case where the macroblock is the last macroblock in the row, Predictor B for block 1 is handled differently (FIGS. 8A and 8B). In this case, the predictor is taken from block 2 in the macroblock immediately above the current macroblock instead of from block 2 in the macroblock above and to the right of the current macroblock, as is the case otherwise. In general, if the macroblock is in the first macroblock column, then Predictor C for blocks 0 and 2 are set equal to 0.
If a macroblock of a progressive P-frame is coded as skipped, the motion vector predictor for it is used as the motion vector for the macroblock (or the predictors for its blocks are used for the blocks, etc.). A single bit may still be present to indicate which predictor to use in hybrid motion vector prediction.
FIGS. 11 and 12A-B show examples of candidate predictors for motion vector prediction for frame-coded macroblocks and field-coded macroblocks, respectively, in interlaced P-frames. FIG. 11 shows candidate predictors A, B and C for a current frame-coded macroblock in an interior position in an interlaced P-frame (not the first or last macroblock in a macroblock row, not in the top row). Predictors can be obtained from different candidate directions other than those labeled A, B, and C (e.g., in special cases such as when the current macroblock is the first macroblock or last macroblock in a row, or in the top row, since certain predictors are unavailable for such cases). For a current frame-coded macroblock, predictor candidates are calculated differently depending on whether the neighboring macroblocks are field-coded or frame-coded. For a neighboring frame-coded macroblock, the motion vector for it is simply taken as the predictor candidate. For a neighboring field-coded macroblock, the candidate motion vector is determined by averaging the top and bottom field motion vectors.
FIGS. 12A-B show candidate predictors A, B and C for a current field in a field-coded macroblock in an interior position in the field. In FIG. 12A, the current field is a bottom field, and the bottom field motion vectors in the neighboring macroblocks are used as candidate predictors. In FIG. 12B, the current field is a top field, and the top field motion vectors in the neighboring macroblocks are used as candidate predictors. For each field in a current field-coded macroblock, the number of motion vector predictor candidates for each field is at most three, with each candidate coming from the same field type (e.g., top or bottom) as the current field. If a neighboring macroblock is frame-coded, the motion vector for it is used as its top field predictor and bottom field predictor. Again, various special cases (not shown) apply when the current macroblock is the first macroblock or last macroblock in a row, or in the top row, since certain predictors are unavailable for such cases. If the frame is one macroblock wide, the motion vector predictor is Predictor A. If a neighboring macroblock is intra, the motion vector predictor for it is 0.
FIGS. 13A and 13B show pseudocode for calculating motion vector predictors given a set of Predictors A, B, and C. To select a predictor from a set of predictor candidates, the encoder and decoder use a selection algorithm such as the median-of-three algorithm shown in 13C.
D. Hybrid Motion Vector Prediction for Progressive P-frames
Hybrid motion vector prediction is allowed for motion vectors of progressive P-frames. For a motion vector of a macroblock or block, whether the progressive P-frame is 1 MV or mixed-MV, the motion vector predictor calculated in the previous section is tested relative to the A and C predictors to determine if a predictor selection is explicitly coded in the bitstream. If so, then a bit is decoded that indicates whether to use predictor A or predictor C as the motion vector predictor for the motion vector (instead of using the motion vector predictor computed in section C, above). Hybrid motion vector prediction is not used in motion vector prediction for interlaced P-frames or any representation of interlaced video.
The pseudocode in FIGS. 14A and 14B illustrates hybrid motion vector prediction for motion vectors of progressive P-frames. In the pseudocode, the variables predictor_pre_x and predictor_pre_y are the horizontal and vertical motion vector predictors, respectively, as calculated in the previous section. The variables predictor_post_x and predictor_post_y are the horizontal and vertical motion vector predictors, respectively, after checking for hybrid motion vector prediction.
E. Decoding Motion Vector Differentials
For macroblocks or blocks of progressive P-frames, the MVDATA or BLKMVDATA elements signal motion vector differential information. A 1 MV macroblock has a single MVDATA. A 4 MV macroblock has between zero and four BLKMVDATA elements (whose presence is indicated by CBPCY).
A MVDATA or BLKMVDATA jointly encodes three things: (1) the horizontal motion vector differential component; (2) the vertical motion vector differential component; and (3) a binary “last” flag that generally indicates whether transform coefficients are present. Whether the macroblock (or block, for 4 MV) is intra or inter-coded is signaled as one of the motion vector differential possibilities. The pseudocode in FIGS. 15A and 15B illustrates how the motion vector differential information, inter/intra type, and last flag information are decoded for MVDATA or BLKMVDATA. In the pseudocode, the variable last-flag is a binary flag whose use is described in the section on signaling macroblock information. The variable intra_flag is a binary flag indicating whether the block or macroblock is intra. The variables dmv_x and dmv_y are differential horizontal and vertical motion vector components, respectively. The variables k_x and k_y are fixed lengths for extended range motion vectors, whose values vary as shown in the table in FIG. 15C. The variable halfpel_flag is a binary value indicating whether half-pixel of quarter-pixel precision is used for the motion vector, and whose value is set based on picture layer syntax elements. Finally, the tables size_table and offset_table are arrays defined as follows:
size_table[6]={0, 2, 3, 4, 5, 8}, and
offset_table[6]={0, 1, 3, 7, 15, 31}.
For frame-coded or field-coded macroblocks of interlaced P-frames, the MVDATA, TOPMVDATA, and BOTMVDATA elements are decoded the same way.
F. Reconstructing and Deriving Motion Vectors
Luminance motion vectors are reconstructed from encoded motion vector differential information and motion vector predictors, and chrominance motion vectors are derived from the reconstructed luminance motion vectors.
For 1 MV and 4 MV macroblocks of progressive P-frames, a luminance motion vector is reconstructed by adding the differential to the motion vector predictor as follows:mv_x=(dmv_x+predictor_x)smod range_x,mv_y=(dmv_y+predictor_y)smod range_y,where smod is a signed modulus operation defined as follows:A smod b=((A+b)%2b)−b, which ensures that the reconstructed vectors are valid.
In a 1 MV macroblock, there is a single motion vector for the four blocks that make up the luminance component of the macroblock. If the macroblock is intra, then no motion vectors are associated with the macroblock. If the macroblock is skipped then dmv_x=0 and dmv_y=0, so mv_x=predictor_x and mv_y=predictor_y.
Each inter luminance block in a 4 MV macroblock has its own motion vector. Therefore, there will be between 0 and 4 luminance motion vectors in a 4 MV macroblock. A non-coded block in a 4 MV macroblock can occur if the 4 MV macroblock is skipped or if CBPCY for the 4 MV macroblock indicates that the block is non-coded. If a block is not coded then dmv_x=0 and dmv_y=0, so mv_x=predictor_x and mv_y=predictor_y.
For progressive P-frames, the chroma motion vectors are derived from the luminance motion vectors. Also, for 4 MV macroblocks, the decision of whether to code chroma blocks as inter or intra is made based on the status of the luminance blocks. The chroma vectors are reconstructed in two steps.
In the first step, a nominal chroma motion vector is obtained by combining and scaling luminance motion vectors appropriately. The scaling is performed in such a way that half-pixel offsets are preferred over quarter-pixel offsets. FIG. 16A shows pseudocode for scaling when deriving a chroma motion vector from a luminance motion vector for a 1 MV macroblock. FIG. 16B shows pseudocode for combining up to four luminance motion vectors and scaling when deriving a chroma motion vector for a 4 MV macroblock. FIG. 13C shows pseudocode for the median3( ) function, and FIG. 16C shows pseudocode for the median4( ) function.
In the second step, a sequence level one-bit element is used to determine if further rounding of chroma motion vectors is necessary. If so, the chroma motion vectors that are at quarter-pixel offsets are rounded to the nearest full-pixel positions.
For frame-coded and field-coded macroblocks of interlaced P-frames, a luminance motion vector is reconstructed as done for progressive P-frames. In a frame-coded macroblock, there is a single motion vector for the four blocks that make up the luminance component of the macroblock. If the macroblock is intra, then no motion vectors are associated with the macroblock. If the macroblock is skipped then dmv_x=0 and dmv_y=0, so mv_x=predictor_x and mv_y=predictor_y. In a field-coded macroblock, each field may have its own motion vector. Therefore, there will be between 0 and 2 luminance motion vectors in a field-coded macroblock. A non-coded field in a field-coded macroblock can occur if the field-coded macroblock is skipped or if CBPCY for the field-coded macroblock indicates that the field is non-coded. If a field is not coded then dmv_x=0 and dmv_y=0, so mv_x=predictor_x and mv_y=predictor_y.
For interlaced P-frames, chroma motion vectors are derived from the luminance motion vectors. For a frame-coded macroblock, there is one chrominance motion vector corresponding to the single luminance motion vector. For a field-coded macroblock, there are two chrominance motion vectors. One is for the top field and one is for the bottom field, corresponding to the top and bottom field luminance motion vectors. The rules for deriving a chroma motion vector are the same for both field-coded and frame-coded macroblocks. They depend on the luminance motion vector, not the type of macroblock. FIG. 17 shows pseudocode for deriving a chroma motion vector from a luminance motion vector for a frame-coded or field-coded macroblock of an interlaced P-frame. Basically, the x component of the chrominance motion vector is scaled by four while the y component of the chrominance motion vector remains the same (because of 4:1:1 macroblock chroma sub-sampling). The scaled x component of the chrominance motion vector is also rounded to a neighboring quarter-pixel location. If cmv_x or cmv_y is out of bounds, it is pulled back to a valid range.
G. Intensity Compensation
For a progressive P-frame, the picture layer contains syntax elements that control the motion compensation mode and intensity compensation for the frame. If intensity compensation is signaled, then the LUMSCALE and LUMSHIFT elements follow in the picture layer. LUMSCALE and LUMSHIFT are six-bit values that specify parameters used in the intensity compensation process.
When intensity compensation is used for the progressive P-frame, the pixels in the reference frame are remapped prior to using them in motion-compensated prediction for the P-frame. The pseudocode in FIG. 18 illustrates how the LUMSCALE and LUMSHIFT elements are used to build the lookup table used to remap the reference frame pixels. The Y component of the reference frame is remapped using the LUTY[] table, and the U and V components are remapped using the LUTUV[] table, as follows:
 pY=LUTY[pY], and
 pUV=LUTUV[pUV],
where pY is the original luminance pixel value in the reference frame, pY is the remapped luminance pixel value in the reference frame, pUV is the original U or V pixel value in the reference frame, and pUV is the remapped U or V pixel value in the reference frame.
For an interlaced P-frame, a one-bit picture-layer INTCOMP value signals whether intensity compensation is used for the frame. If intensity compensation is used, then the LUMSCALE and LUMSHIFT elements follow in the picture layer, where LUMSCALE and LUMSHIFT are six-bit values which specify parameters used in the intensity compensation process for the whole interlaced P-frame. The intensity compensation itself is the same as for progressive P-frames.
VI. Standards for Video Compression and Decompression
Aside from previous WMV encoders and decoders, several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another name for MPEG 2), H.263, and H.264 standards from the International Telecommunication Union [“ITU”]. An encoder and decoder complying with one of these standards typically use motion estimation and compensation to reduce the temporal redundancy between pictures.
A. Reference Pictures for Motion Compensation
For several standards, motion compensation for a forward-predicted frame is relative to a single reference frame, which is the previously reconstructed I- or P-frame that immediately precedes the current forward-predicted frame. Since the reference frame for the current forward-predicted frame is known and only one reference frame is possible, information used to select between multiple reference frames is not needed. See, e.g., the H.261 and MPEG 1 standards. In certain encoding/decoding scenarios (e.g., high bit rate interlaced video with lots of motion), limiting motion compensation for forward prediction to be relative to a single reference can hurt overall compression efficiency.
The H.262 standard allows an interlaced video frame to be encoded as a single frame or as two fields, where the frame encoding or field encoding can be adaptively selected on a frame-by-frame basis. For field-based prediction of a current field, the motion compensation uses a previously reconstructed top field or bottom field. [H.262 standard, sections 7.6.1 and 7.6.2.1.] The H.262 standard describes selecting between the two reference fields to use for motion compensation with a motion vector for a current field. [H.262 standard, sections 6.2.5.2, 6.3.17.2, and 7.6.4.] For a given motion vector for a 16×16 macroblock (or top 16×8 half of the macroblock, or bottom 16×8 half of the macroblock), a single bit is signaled to indicate whether to apply the motion vector to the top reference field or the bottom reference field. [Id.] For additional detail, see the H.262 standard.
While such reference field selection provides some flexibility and prediction improvement in motion compensation in some cases, it has several disadvantages relating to bit rate. The reference field selection signals for the motion vectors can consume a lot of bits. For example, for a single 720×288 field with 810 macroblocks, each macroblock having 0, 1, or 2 motion vectors, the reference field selection bits for the motion vectors consume up to 1620 bits. No attempt is made to reduce the bit rate of reference field selection information by predicting which reference fields will be selected for the respective motion vectors. The signaling of reference field selection information is inefficient in terms of pure coding efficiency. Moreover, for some scenarios, however the information is encoded, the reference field selection information may consume so many bits that the benefits of prediction improvements from having multiple available references in motion compensation are outweighed. No option is given to disable reference field selection to address such scenarios.
The H.262 standard also describes dual-prime prediction, which is a prediction mode in which two forward field-based predictions are averaged for a 16×16 block in an interlaced P-picture. [H.262 standard, section 7.6.3.6.]
The MPEG-4 standard allows macroblocks of an interlaced video frame to be frame-coded or field-coded. [MPEG-4 standard, section 6.1.3.8.] For field-based prediction of top or bottom field lines of a field-coded macroblock, the motion compensation uses a previously reconstructed top field or bottom field. [MPEG-4 standard, sections 6.3.7.3 and 7.6.2.] The MPEG-4 standard describes selecting between the two reference fields to use for motion compensation. [MPEG-4 standard, sections 6.3.7.3.] For a given motion vector for top field lines or bottom field lines of a macroblock, a single bit is signaled to indicate whether to apply the motion vector to the top reference field or the bottom reference field. [Id.] For additional detail, see the MPEG-4 standard. Such signaling of reference field selection information has problems similar to those described above for H.262.
The H.263 standard describes motion compensation for progressive P-frames, including an optional reference picture selection mode. [H.263 standard, section 3.4.12, Annex N.] Normally, the most recent temporally previous anchor picture is used for motion compensation. When reference picture selection mode is used, however, temporal prediction is allowed from pictures other than the most recent reference picture. [Id.] This can improve the performance of real-time video communication over error-prone channels by allowing the encoder to optimize its video encoding for the conditions of the channel (e.g., to stop error propagation due to loss of information needed for reference in inter-frame coding). [Id.] When used, for a given group of blocks or slice within a picture, a 10-bit value indicates the reference used for prediction of the group of blocks or slice. [Id.] The reference picture selection mechanism described in H.263 is for progressive video and is adapted to address the problem of error propagation in error-prone channels, not to improve compression efficiency per se.
In draft JVT-D 157 of the H.264 standard, the inter prediction process for motion-compensated prediction of a block can involve selection of the reference picture from a number of stored, previously decoded pictures. [JVT-D157, section 0.4.3.] At the picture level, one or more parameters specify the number of reference pictures that are used to decode the picture. [JVT-D157, sections 7.3.2.2 and 7.4.2.2.] At the slice level, the number of reference pictures available may be changed, and additional parameters may be received to reorder and manage which reference pictures are in a list. [JVT-D157, sections 7.3.3 and 7.4.3.] For a given motion vector (for a macroblock or sub-macroblock part), a reference index when present indicates the reference picture to be used for prediction. [JVT-D 157, sections 7.3.5.1 and 7.4.5.1.] The reference index indicates the first, second, third, etc. frame or field in the list. [Id.] If there is only one active reference picture in the list, the reference index is not present. [Id.] If there are only two active reference pictures in the list, a single encoded bit is used to represent the reference index. [Id.] For additional detail, see draft JVT-D157 of the H.264 standard.
The reference picture selection of JVT-D157 provides flexibility and thereby can improve prediction for motion compensation. However, the processes of managing reference picture lists and signaling reference picture selections are complex and consume an inefficient number of bits in some scenarios.
B. Signaling Macroblock Modes
The various standards use different mechanisms to signal macroblock information. In the H.261 standard, for example, a macroblock header for a macroblock includes a macroblock type MTYPE element, which is signaled as a VLC. [H.261 standard, section 4.2.3.] A MTYPE element indicates a prediction mode (intra, inter, inter+MC, inter+MC+loop filtering), whether a quantizer MQUANT element is present for the macroblock, whether a motion vector data MVD element is present for the macroblock, whether a coded block pattern CBP element is present for the macroblock, and whether transform coefficient TCOEFF elements are present for blocks of the macroblock. [Id.] A MVD element is present for every motion-compensated macroblock. [Id.]
In the MPEG-1 standard, a macroblock has a macroblock_type element, which is signaled as a VLC. [MPEGB-1 standard, section 2.4.3.6, Tables B.2a through B.2d, D.6.4.2.] For a macroblock in a forward-predicted picture, the macroblock_type element indicates whether a quantizer scale element is present for the macroblock, whether forward motion vector data is present for the macroblock, whether a coded block pattern element is present for the macroblock, and whether the macroblock is intra. [Id.] Forward motion vector data is always present if the macroblock uses forward motion compensation. [Id.]
In the H.262 standard, a macroblock has a macroblock_type element, which is signaled as a VLC. [H.261 standard, section 6.2.5.1, 6.3.17.1, and Tables B.2 through B.8.] For a macroblock in a forward-predicted picture, the macroblock_type element indicates whether a quantizer_scale_code element is present for the macroblock, whether forward motion vector data is present for the macroblock, whether a coded block pattern element is present for the macroblock, whether the macroblock is intra, and scalability options for the macroblock. [Id.] Forward motion vector data is always present if the macroblock uses forward motion compensation. [Id.] A separate code (frame_motion_type or field_motion_type) may further indicate the macroblock prediction type, including the count of motion vectors and motion vector format for the macroblock. [Id.]
In the H.263 standard, a macroblock has macroblock type and coded block pattern for chrominance MCBPC element, which is signaled as a VLC. [H.263 standard, section 5.3.2, Tables 8 and 9, and F.2.] The macroblock type gives information about the macroblock (e.g., inter, inter4V, intra). [Id.] For a coded macroblock in an inter-coded picture, MCBPC and coded block pattern for luminance are always present, and the macroblock type indicates whether a quantizer information element is present for the macroblock. A forward motion-compensated macroblock always has motion vector data for the macroblock (or blocks for inter4V type) present. [Id.] The MPEG-4 standard similarly specifies a MCBPC element that is signaled as a VLC. [MPEG-4 standard, sections 6.2.7, 6.3.7, 11.1.1.]
In JVT-D157, the mb_type element is part of the macroblock layer. [JVT-D157, sections 7.3.5 and 7.4.5.] The mb_type indicates the macroblock type and various associated information. [Id.] For example, for a P-slice, the mb_type element indicates the type of prediction (intra or forward), various intra mode coding parameters if the macroblock is intra coded, the macroblock partitions (e.g., 16×16, 16×8, 8×16, or 8×8) and hence the number of motion vectors if the macroblock is forward predicted, and whether reference picture selection information is present (if the partitions are 8×8). [Id.] The type of prediction and mb type also collectively indicate whether a coded block pattern element is present for the macroblock. [Id.] For each 16×16, 16×8, or 8×16 partition in a forward motion-compensated macroblock, motion vector data is signaled. [Id.] For a forward-predicted macroblock with 8×8 partitions, a sub_mb_type element per 8×8 partition indicates the type of prediction (intra or forward) for it. [Id.] If the 8×8 partition is forward predicted, sub_mb_type indicates the sub-partitions (e.g., 8×8, 8×4, 4×8, or 4×4), and hence the number of motion vectors, for the 8×8 partition. [Id.] For each sub-partition in a forward motion-compensated 8×8 partition, motion vector data is signaled. [Id.]
The various standards use a large variety of signaling mechanisms for macroblock information. Whatever advantages these signaling mechanisms may have, they also have the following disadvantages. First, they at times do not efficiently signal macroblock type, presence/absence of coded block pattern information, and presence/absence of motion vector differential information for motion-compensated macroblocks. In fact, the standards typically do not signal presence/absence of motion vector differential information for motion-compensated macroblocks (or blocks or fields thereof) at all, instead assuming that the motion vector differential information is signaled if motion compensation is used. Finally, the standards are inflexible in their decisions of which code tables to use for macroblock mode information.
C. Motion Vector Prediction
Each of H.261, H.262, H.263, MPEG-1, MPEG-4, and JVT-D157 specifies some form of motion vector prediction, although the details of the motion vector prediction vary widely between the standards. Motion vector prediction is simplest in the H.261 standard, for example, in which the motion vector predictor for the motion vector of a current macroblock is the motion vector of the previously coded/decoded macroblock. [H.261 standard, section 4.2.3.4.] The motion vector predictor is 0 for various special cases (e.g., the current macroblock is the first in a row). Motion vector prediction is similar in the MPEG-1 standard. [MPEG-1 standard, sections 2.4.4.2 and D.6.2.3.]
Other standards (such as H.262) specify much more complex motion vector prediction, but still typically determine a motion vector predictor from a single neighbor. [H.262 standard, section 7.6.3.] Determining a motion vector predictor from a single neighbor suffices when motion is uniform, but is inefficient in many other cases.
So, still other standards (such as H.263, MPEG-4, JVT-D157) determine a motion vector predictor from multiple different neighbors with different candidate motion vector predictors. [H.263 standard, sections 6.1.1; MPEG-4 standard, sections 7.5.5 and 7.6.2; and F.2; JVT-D157, section 8.4.1.] These are efficient for more kinds of motion, but still do not adequately address scenarios in which there is a high degree of variance between the different candidate motion vector predictors, indicating discontinuity in motion patterns.
For additional detail, see the respective standards.
D. Decoding Motion Vector Differentials
Each of H.261, H.262, H.263, MPEG-1, MPEG-4, and JVT-D157 specifies some form of differential motion vector coding and decoding, although the details of the coding and decoding vary widely between the standards. Motion vector coding and decoding is simplest in the H.261 standard, for example, in which one VLC represents the horizontal differential component, and another VLC represents the vertical differential component. [H.261 standard, section 4.2.3.4.] Other standards specify more complex coding and decoding for motion vector differential information. For additional detail, see the respective standards.
E. Reconstructing and Deriving Motion Vectors
In general, a motion vector in H.261, H.262, H.263, MPEG-1, MPEG-4, or JVT-D157 is reconstructed by combining a motion vector predictor and a motion vector differential. Again, the details of the reconstruction vary from standard to standard.
Chrominance motion vectors (which are not signaled) are typically derived from luminance motion vectors (which are signaled). For example, in the H.261 standard, luminance motion vectors are halved and truncated towards zero to derive chrominance motion vectors. [H.261 standard, section 3.2.2.] Similarly, luminance motion vectors are halved to derive chrominance motion vector in the MPEG-1 standard and JVT-D157. [MPEG-1 standard, section 2.4.4.2; JVT-D157, section 8.4.1.4.]
In the H.262 standard, luminance motion vectors are scaled down to chroma motion vectors by factors that depend on the chrominance sub-sampling mode (e.g., 4:2:0, 4:2:2, or 4:4:4). [H.262 standard, section 7.6.3.7.]
In the H.263 standard, for a macroblock with a single luminance motion vector for all four luminance blocks, a chrominance motion vector is derived by dividing the luminance motion vector by two and rounding to a half-pixel position. [H.263 standard, section 6.1.1.] For a macroblock with four luminance motion vectors (one per block), a chrominance motion vector is derived by summing the four luminance motion vectors, dividing by eight, and rounding to a half-pixel position. [H.263 standard, section F.2.] Chrominance motion vectors are similarly derived in the MPEG-4 standard. [MPEG-4 standard, sections 7.5.5 and 7.6.2.]
F. Weighted Prediction
Draft JVT-D157 of the H.264 standard describes weighted prediction. A weighted prediction flag for a picture indicates whether or not weighted prediction is used for predicted slices in the picture. [JVT-D157, sections 7.3.2.2 and 7.4.2.2.] If weighted prediction is used for a picture, each predicted slice in the picture has a table of prediction weights. [JVT-D157, sections 7.3.3, 7.3.3.2, 7.4.3.3, and 10.4.1.] For the table, a denominator for luma weight parameters and a denominator for chroma weight parameters are signaled. [Id.] Then, for each reference picture available for the slice, a luma weight flag indicates whether luma weight and luma offset numerator parameters are signaled for the picture (followed by the parameters, when signaled), and a chroma weight flag indicates whether chroma weight and chroma offset numerator parameters are signaled for the picture (followed by the parameters, when signaled). [Id.] Numerator weight parameters that are not signaled are given default values relating to the signaled denominator values. [Id.] While JVT-D157 provides some flexibility in signaling weighted prediction parameters, the signaling mechanism is inefficient in various scenarios.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.