Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 pictures per second. Each picture can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. Intra compression techniques compress individual pictures, typically called I-frames or key frames. Inter compression techniques compress frames with reference to preceding and/or following frames, and inter-compressed frames are typically called predicted frames, P-frames, or B-frames.
I. Inter Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
Inter compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 1 and 2 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 1 illustrates motion estimation for a predicted frame 110 and FIG. 2 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 1, the WMV8 encoder computes a motion vector for a macroblock 115 in the predicted frame 110. To compute the motion vector, the encoder searches in a search area 135 of a reference frame 130. Within the search area 135, the encoder compares the macroblock 115 from the predicted frame 110 to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock.
Since a motion vector value is often correlated with the values of spatially surrounding motion vectors, compression of the data used to transmit the motion vector information can be achieved by selecting a motion vector predictor based upon motion vectors of neighboring macroblocks and predicting the motion vector for the current macroblock using the motion vector predictor. The encoder can encode the differential between the motion vector and the predictor. After reconstructing the motion vector by adding the differential to the predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 115 using information from the reference frame 130, which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 115 itself.
FIG. 2 illustrates an example of computation and encoding of an error block 235 in the WMV8 encoder. The error block 235 is the difference between the predicted block 215 and the original current block 225. The encoder applies a discrete cosine transform [“DCT”] 240 to the error block 235, resulting in an 8×8 block 245 of coefficients. The encoder then quantizes 250 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 255. The encoder scans 260 the 8×8 block 255 into a one-dimensional array 265 such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding 270. The encoder selects an entropy code from one or more run/level/last tables 275 and outputs the entropy code.
FIG. 3 shows an example of a corresponding decoding process 300 for an inter-coded block. In summary of FIG. 3, a decoder decodes (310, 320) entropy-coded information representing a prediction residual using variable length decoding 310 with one or more run/level/last tables 315 and run length decoding 320. The decoder inverse scans 330 a one-dimensional array 325 storing the entropy-decoded information into a two-dimensional block 335. The decoder inverse quantizes and inverse discrete cosine transforms (together, 340) the data, resulting in a reconstructed error block 345. In a separate motion compensation path, the decoder computes a predicted block 365 using motion vector information 355 for displacement from a reference frame. The decoder combines 370 the predicted block 365 with the reconstructed error block 345 to form the reconstructed block 375.
The amount of change between the original and reconstructed frames is the distortion and the number of bits required to code the frame indicates the rate for the frame. The amount of distortion is roughly inversely proportional to the rate.
II. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 4, an interlaced video frame 400 includes top field 410 and bottom field 420. Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present because the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
III. P-Frame Coding and Decoding in a Previous WMV Encoder and Decoder
The encoder and decoder use progressive and interlace coding and decoding in P-frames. In interlaced and progressive P-frames, a motion vector is encoded in the encoder by computing a differential between the motion vector and a motion vector predictor, which is computed based on neighboring motion vectors. And, in the decoder, the motion vector is reconstructed by adding the motion vector differential to the motion vector predictor, which is again computed (this time in the decoder) based on neighboring motion vectors. Thus, a motion vector predictor for the current macroblock or field of the current macroblock is selected based on the candidates, and a motion vector differential is calculated based on the motion vector predictor. The motion vector can be reconstructed by adding the motion vector differential to the selected motion vector predictor at either the encoder or the decoder side. Typically, luminance motion vectors are reconstructed from the encoded motion information, and chrominance motion vectors are derived from the reconstructed luminance motion vectors.
A. Progressive P-Frame Coding and Decoding
For example, in the encoder and decoder, progressive P-frames can contain macroblocks encoded in one motion vector (1MV) mode or in four motion vector (4MV) mode, or skipped macroblocks, with a decision generally made on a macroblock-by-macroblock basis. P-frames with only 1MV macroblocks (and, potentially, skipped macroblocks) are referred to as 1MV P-frames, and P-frames with both 1MV and 4MV macroblocks (and, potentially, skipped macroblocks) are referred to as Mixed-MV P-frames. One luma motion vector is associated with each 1MV macroblock, and up to four luma motion vectors are associated with each 4MV macroblock (one for each block).
FIGS. 5A and 5B are diagrams showing the locations of macroblocks considered for candidate motion vector predictors for a macroblock in a 1MV progressive P-frame. The candidate predictors are taken from the left, top and top-right macroblocks, except in the case where the macroblock is the last macroblock in the row. In this case, Predictor B is taken from the top-left macroblock instead of the top-right. For the special case where the frame is one macroblock wide, the predictor is always Predictor A (the top predictor). When Predictor A is out of bounds because the macroblock is in the top row, the predictor is Predictor C. Various other rules address other special cases such as intra-coded predictors.
FIGS. 6A-10 show the locations of the blocks or macroblocks considered for the up-to-three candidate motion vectors for a motion vector for a 1MV or 4MV macroblock in a Mixed-MV frame. In the following figures, the larger squares are macroblock boundaries and the smaller squares are block boundaries. For the special case where the frame is one macroblock wide, the predictor is always Predictor A (the top predictor). Various other rules address other special cases such as top row blocks for top row 4MV macroblocks, top row 1MV macroblocks, and intra-coded predictors.
FIGS. 6A and 6B are diagrams showing locations of blocks considered for candidate motion vector predictors for a 1MV current macroblock in a Mixed-MV frame. The neighboring macroblocks may be 1MV or 4MV macroblocks. FIGS. 6A and 6B show the locations for the candidate motion vectors assuming the neighbors are 4MV (i.e., predictor A is the motion vector for block 2 in the macroblock above the current macroblock, and predictor C is the motion vector for block 1 in the macroblock immediately to the left of the current macroblock). If any of the neighbors is a 1MV macroblock, then the motion vector predictor shown in FIGS. 5A and 5B is taken to be the motion vector predictor for the entire macroblock. As FIG. 6B shows, if the macroblock is the last macroblock in the row, then Predictor B is from block 3 of the top-left macroblock instead of from block 2 in the top-right macroblock as is the case otherwise.
FIGS. 7A-10 show the locations of blocks considered for candidate motion vector predictors for each of the 4 luminance blocks in a 4MV macroblock. FIGS. 7A and 7B are diagrams showing the locations of blocks considered for candidate motion vector predictors for a block at position 0; FIGS. 8A and 8B are diagrams showing the locations of blocks considered for candidate motion vector predictors for a block at position 1; FIG. 9 is a diagram showing the locations of blocks considered for candidate motion vector predictors for a block at position 2; and FIG. 10 is a diagram showing the locations of blocks considered for candidate motion vector predictors for a block at position 3. Again, if a neighbor is a 1MV macroblock, the motion vector predictor for the macroblock is used for the blocks of the macroblock.
For the case where the macroblock is the first macroblock in the row, Predictor B for block 0 is handled differently than block 0 for the remaining macroblocks in the row (see FIGS. 7A and 7B). In this case, Predictor B is taken from block 3 in the macroblock immediately above the current macroblock instead of from block 3 in the macroblock above and to the left of current macroblock, as is the case otherwise. Similarly, for the case where the macroblock is the last macroblock in the row, Predictor B for block 1 is handled differently (FIGS. 8A and 8B). In this case, the predictor is taken from block 2 in the macroblock immediately above the current macroblock instead of from block 2 in the macroblock above and to the right of the current macroblock, as is the case otherwise. In general, if the macroblock is in the first macroblock column, then Predictor C for blocks 0 and 2 are set equal to 0.
B. Interlaced P-Frame Coding and Decoding
The encoder and decoder use a 4:1:1 macroblock format for interlaced P-frames, which can contain macroblocks encoded in field mode or in frame mode, or skipped macroblocks, with a decision generally made on a macroblock-by-macroblock basis. Two motion vectors are associated with each field-coded macroblock (one motion vector per field), and one motion vector is associated with each frame-coded macroblock. An encoder jointly encodes motion information, including horizontal and vertical motion vector differential components, potentially along with other signaling information.
FIGS. 11, 12 and 13 show examples of candidate predictors for motion vector prediction for frame-coded 4:1:1 macroblocks and field-coded 4:1:1 macroblocks, respectively, in interlaced P-frames in the encoder and decoder. FIG. 11 shows candidate predictors A, B and C for a current frame-coded 4:1:1 macroblock in an interior position in an interlaced P-frame (not the first or last macroblock in a macroblock row, not in the top row). Predictors can be obtained from different candidate directions other than those labeled A, B, and C (e.g., in special cases such as when the current macroblock is the first macroblock or last macroblock in a row, or in the top row, since certain predictors are unavailable for such cases). For a current frame-coded macroblock, predictor candidates are calculated differently depending on whether the neighboring macroblocks are field-coded or frame-coded. For a neighboring frame-coded macroblock, the motion vector is simply taken as the predictor candidate. For a neighboring field-coded macroblock, the candidate motion vector is determined by averaging the top and bottom field motion vectors.
FIGS. 12 and 13 show candidate predictors A, B and C for a current field in a field-coded 4:1:1 macroblock in an interior position in the field. In FIG. 12, the current field is a bottom field, and the bottom field motion vectors in the neighboring macroblocks are used as candidate predictors. In FIG. 13, the current field is a top field, and the top field motion vectors in the neighboring macroblocks are used as candidate predictors. Thus, for each field in a current field-coded macroblock, the number of motion vector predictor candidates for each field is at most three, with each candidate coming from the same field type (e.g., top or bottom) as the current field. Again, various special cases (not shown) apply when the current macroblock is the first macroblock or last macroblock in a row, or in the top row, since certain predictors are unavailable for such cases.
To select a predictor from a set of predictor candidates, the encoder and decoder use different selection algorithms, such as a median-of-three algorithm. A procedure for median-of-three prediction is described in pseudo-code 1400 in FIG. 14.
IV. B-Frame Coding and Decoding in a Previous WMV Encoder and Decoder
The encoder and decoder use progressive and interlaced B-frames. B-frames use two frames from the source video as reference (or anchor) frames rather than the one anchor used in P-frames. Among anchor frames for a typical B-frame, one anchor frame is from the temporal past and one anchor frame is from the temporal future. Referring to FIG. 15, a B-frame 1510 in a video sequence has a temporally previous reference frame 1520 and a temporally future reference frame 1530. Encoded bit streams with B-frames typically use less bits than encoded bit streams with no B-frames, while providing similar visual quality. A decoder also can accommodate space and time restrictions by opting not to decode or display B-frames, since B-frames are not generally used as reference frames.
While macroblocks in forward-predicted frames (e.g., P-frames) have only one directional mode of prediction (forward, from previous I- or P-frames), macroblocks in B-frames can be predicted using five different prediction modes: forward, backward, direct, interpolated and intra. The encoder selects and signals different prediction modes in the bit stream. Forward mode is similar to conventional P-frame prediction. In forward mode, a macroblock is derived from a temporally previous anchor. In backward mode, a macroblock is derived from a temporally subsequent anchor. Macroblocks predicted in direct or interpolated modes use both forward and backward anchors for prediction.
V. Signaling Macroblock Information in a Previous WMV Encoder and Decoder
In the encoder and decoder, macroblocks in interlaced P-frames can be one of three possible types: frame-coded, field-coded and skipped. The macroblock type is indicated by a multi-element combination of frame-level and macroblock-level syntax elements.
For interlaced P-frames, the frame-level element INTRLCF indicates the mode used to code the macroblocks in that frame. If INTRLCF=0, all macroblocks in the frame are frame-coded. If INTRLCF=1, the macroblocks may be field-coded or frame-coded. The INTRLCMB element is present at in the frame layer when INTRLCF=1. INTRLCMB is a bitplane-coded array that indicates the field/frame coding status for each macroblock in the picture. The decoded bitplane represents the interlaced status for each macroblock as an array of 1-bit values. A value of 0 for a particular bit indicates that a corresponding macroblock is coded in frame mode. A value of 1 indicates that the corresponding macroblock is coded in field mode.
For frame-coded macroblocks, the macroblock-level MVDATA element is associated with all blocks in the macroblock. MVDATA signals whether the blocks in the macroblocks are intra-coded or inter-coded. If they are inter-coded, MVDATA also indicates the motion vector differential.
For field-coded macroblocks, a TOPMVDATA element is associated with the top field blocks in the macroblock and a BOTMVDATA element is associated with the bottom field blocks in the macroblock. TOPMVDATA and BOTMVDATA are sent at the first block of each field. TOPMVDATA indicates whether the top field blocks are intra-coded or inter-coded. Likewise, BOTMVDATA indicates whether the bottom field blocks are intra-coded or inter-coded. For inter-coded blocks, TOPMVDATA and BOTMVDATA also indicate motion vector differential information.
The CBPCY element indicates coded block pattern (CBP) information for luminance and chrominance components in a macroblock. The CBPCY element also indicates which fields have motion vector data elements present in the bitstream. CBPCY and the motion vector data elements are used to specify whether blocks have AC coefficients. CBPCY is present for a frame-coded macroblock of an interlaced P-frame if the “last” value decoded from MVDATA indicates that there are data following the motion vector to decode. If CBPCY is present, it decodes to a 6-bit field, one bit for each of the four Y blocks, one bit for both U blocks (top field and bottom field), and one bit for both V blocks (top field and bottom field).
CBPCY is always present for a field-coded macroblock. CBPCY and the two field motion vector data elements are used to determine the presence of AC coefficients in the blocks of the macroblock. The meaning of CBPCY is the same as for frame-coded macroblocks for bits 1, 3, 4 and 5. That is, they indicate the presence or absence of AC coefficients in the right top field Y block, right bottom field Y block, top/bottom U blocks, and top/bottom V blocks, respectively. For bit positions 0 and 2, the meaning is slightly different. A 0 in bit position 0 indicates that TOPMVDATA is not present and the motion vector predictor is used as the motion vector for the top field blocks. It also indicates that the left top field block does not contain any nonzero coefficients. A 1 in bit position 0 indicates that TOPMVDATA is present. TOPMVDATA indicates whether the top field blocks are inter or intra and, if they are inter, also indicates the motion vector differential. If the “last” value decoded from TOPMVDATA decodes to 1, then no AC coefficients are present for the left top field block, otherwise, there are nonzero AC coefficients for the left top field block. Similarly, the above rules apply to bit position 2 for BOTMVDATA and the left bottom field block.
VI. Skipped Macroblocks in a Previous WMV Encoder and Decoder
The encoder and decoder use skipped macroblocks to reduce bitrate. For example, the encoder signals skipped macroblocks in the bitstream. When the decoder receives information (e.g., a skipped macroblock flag) in the bitstream indicating that a macroblock is skipped, the decoder skips decoding residual block information for the macroblock. Instead, the decoder uses corresponding pixel data from a co-located or motion compensated (with a motion vector predictor) macroblock in a reference frame to reconstruct the macroblock. The encoder and decoder select between multiple coding/decoding modes for encoding and decoding the skipped macroblock information. For example, skipped macroblock information is signaled at frame level of the bitstream (e.g., in a compressed bitplane) or at macroblock level (e.g., with one “skip” bit per macroblock). For bitplane coding, the encoder and decoder select between different bitplane coding modes.
One previous encoder and decoder define a skipped macroblock as a predicted macroblock whose motion is equal to its causally predicted motion and which has zero residual error. Another previous encoder and decoder define a skipped macroblock as a predicted macroblock with zero motion and zero residual error.
For more information on skipped macroblocks and bitplane coding, see U.S. patent application Ser. No. 10/321,415, entitled “Skip Macroblock Coding,” filed Dec. 16, 2002.
VII. Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG-2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union [“ITU”]. These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. These standards use (or support the use of) different combinations of intraframe and interframe decompression and compression.
A. Signaling Field- or Frame-Coded Macroblocks in the Standards
Some international standards describe signaling of field/frame coding type (e.g., field-coding or frame-coding) for macroblocks in interlaced pictures.
Draft JVT-d157 of the JVT/AVC standard describes the mb_field_decoding_flag syntax element, which is used to signal whether a macroblock pair is decoded in frame mode or field mode in interlaced P-frames. Section 7.3.4 describes a bitstream syntax where mb_field_decoding_flag is sent as an element of slice data in cases where a sequence parameter (mb_frame_field_adaptive_flag) indicates switching between frame and field decoding in macroblocks and a slice header element (pic_structure) identifies the picture structure as a progressive picture or an interlaced frame picture.
The May 28, 1998 committee draft of MPEG-4 describes the dct_type syntax element, which is used to signal whether a macroblock is frame DCT coded or field DCT coded. According to Sections 6.2.7.3 and 6.3.7.3, dct_type is a macroblock-layer element that is only present in the MPEG-4 bitstream in interlaced content where the macroblock has a non-zero coded block pattern or is intra-coded.
In MPEG-2, the dct_type element indicates whether a macroblock is frame DCT coded or field DCT coded. MPEG-2 also describes a picture coding extension element frame_pred_frame_dct. When frame_pred_frame_dct is set to ‘1’, only frame DCT coding is used in interlaced frames. The condition dct_type=0 is “derived” when frame_pred_frame_dct=1 and the dct_type element is not present in the bitstream.
B. Skipped Macroblocks in the Standards
Some international standards use skipped macroblocks. For example, draft JVT-d157 of the JVT/AVC standard defines a skipped macroblock as “a macroblock for which no data is coded other than an indication that the macroblock is to be decoded as ‘skipped.’” Similarly, the committee draft of MPEG-4 states, “A skipped macroblock is one for which no information is transmitted.”
C. Limitations of the Standards
These international standards are limited in several important ways. For example, although the standards provide for signaling of macroblock types, field/frame coding type information is signaled separately from motion compensation types (e.g., field prediction or frame prediction, one motion vector or multiple motion vectors, etc.). As another example, although some international standards allow for bitrate savings by skipping certain macroblocks, the skipped macroblock condition in these standards only indicates that no further information for the macroblock is encoded, and fails to provide other potentially valuable information about the macroblock.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.