Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 pictures per second. Each picture can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. For video frames, intra compression techniques compress individual frames, typically called I-frames or key frames. Inter compression techniques compress frames with reference to preceding and/or following frames, and inter-compressed frames are typically called predicted frames, P-frames, or B-frames.
I. Inter and Intra Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
A. Intra Compression
FIG. 1A illustrates block-based intra compression 100 of a block 105 of pixels in a key frame in the WMV8 encoder. A block is a set of pixels, for example, an 8×8 arrangement of pixels. The WMV8 encoder splits a key video frame into 8×8 blocks of pixels and applies an 8×8 Discrete Cosine Transform [“DCT”] 110 to individual blocks such as the block 105. A DCT is a type of frequency transform that converts the 8×8 block of pixels (spatial information) into an 8×8 block of DCT coefficients 115, which are frequency information. The DCT operation itself is lossless or nearly lossless. Compared to the original pixel values, however, the DCT coefficients are more efficient for the encoder to compress since most of the significant information is concentrated in low frequency coefficients (conventionally, the upper left of the block 115) and many of the high frequency coefficients (conventionally, the lower right of the block 115) have values of zero or close to zero.
The encoder then quantizes 120 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 125. For example, the encoder applies a uniform, scalar quantization step size to each coefficient. Quantization is lossy. Since low frequency DCT coefficients tend to have higher values, quantization results in loss of precision but not complete loss of the information for the coefficients. On the other hand, since high frequency DCT coefficients tend to have values of zero or close to zero, quantization of the high frequency coefficients typically results in contiguous regions of zero values. In addition, in some cases high frequency DCT coefficients are quantized more coarsely than low frequency DCT coefficients, resulting in greater loss of precision/information for the high frequency DCT coefficients.
The encoder then prepares the 8×8 block of quantized DCT coefficients 125 for entropy encoding, which is a form of lossless compression. The exact type of entropy encoding can vary depending on whether a coefficient is a DC coefficient (lowest frequency), an AC coefficient (other frequencies) in the top row or left column, or another AC coefficient.
The encoder encodes the DC coefficient 126 as a differential from the DC coefficient 136 of a neighboring 8×8 block, which is a previously encoded neighbor (e.g., top or left) of the block being encoded. (FIG. 1A shows a neighbor block 135 that is situated to the left of the block being encoded in the frame.) The encoder entropy encodes 140 the differential.
The entropy encoder can encode the left column or top row of AC coefficients as a differential from a corresponding left column or top row of the neighboring 8×8 block. This is an example of AC coefficient prediction. FIG. 1A shows the left column 127 of AC coefficients encoded as a differential 147 from the left column 137 of the neighboring (in reality, to the left) block 135. The differential coding increases the chance that the differential coefficients have zero values. The remaining AC coefficients are from the block 125 of quantized DCT coefficients.
FIG. 1B shows AC prediction candidates for an 8×8 block in an I-frame. For top prediction, the top row 177 of AC coefficients in the top neighboring block 175 is used as the predictor for the top row 129 of AC coefficients in the block 125 of quantized DCT coefficients. For left prediction, the leftmost column 137 of AC coefficients in the left neighboring block 135 is used as the predictor for the leftmost column of AC coefficients in the block 125.
In some modes, the AC coefficient predictors are scaled or otherwise processed before computation of or combination with differential values.
If a neighboring block does not exist in the specified prediction direction, the predicted values for all seven AC coefficients in the leftmost column or top row are set to zero. For example, if the prediction direction is up and the current block is in the top row, each of the predicted AC coefficients in the top row of the current block are set to zero because there is no adjacent block in the up direction. The AC coefficients in the predicted row or column are added to the corresponding decoded AC coefficients (which are differential values) in the current block to produce the fully reconstructed quantized transform coefficient block.
The encoder scans 150 the 8×8 block 145 of quantized AC DCT coefficients into a one-dimensional array 155 and then entropy encodes the scanned AC coefficients using a variation of run length coding 160. The encoder selects an entropy code from one or more run/level/last tables 165 and outputs the entropy code.
B. Inter Compression
Inter compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 2 and 3 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 2 illustrates motion estimation for a predicted frame 210 and FIG. 3 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 2, the WMV8 encoder computes a motion vector for a macroblock 215 in the predicted frame 210. To compute the motion vector, the encoder searches in a search area 235 of a reference frame 230. Within the search area 235, the encoder compares the macroblock 215 from the predicted frame 210 to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock. The motion vector is differentially coded with respect to a motion vector predictor.
After reconstructing the motion vector by adding the differential to the motion vector predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 215 using information from the reference frame 230, which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 215 itself.
FIG. 3 illustrates an example of computation and encoding of an error block 335 in the WMV8 encoder. The error block 335 is the difference between the predicted block 315 and the original current block 325. The encoder applies a DCT 340 to the error block 335, resulting in an 8×8 block 345 of coefficients. The encoder then quantizes 350 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 355. The encoder scans 360 the 8×8 block 355 into a one-dimensional array 365 such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding 370. The encoder selects an entropy code from one or more run/level/last tables 375 and outputs the entropy code.
FIG. 4 shows an example of a corresponding decoding process 400 for an inter-coded block. In summary of FIG. 4, a decoder decodes (410, 420) entropy-coded information representing a prediction residual using variable length decoding 410 with one or more run/level/last tables 415 and run length decoding 420. The decoder inverse scans 430 a one-dimensional array 425 storing the entropy-decoded information into a two-dimensional block 435. The decoder inverse quantizes and inverse DCTs (together, 440) the data, resulting in a reconstructed error block 445. In a separate motion compensation path, the decoder computes a predicted block 465 using motion vector information 455 for displacement from a reference frame. The decoder combines 470 the predicted block 465 with the reconstructed error block 445 to form the reconstructed block 475.
In software for a previous WMV encoder and software for a previous WMV decoder, AC prediction information is signaled on a one bit per macroblock basis at macroblock level in the bitstream.
The ACPRED field is a one-bit, macroblock-level bitstream element that specifies whether AC prediction is used to decode the AC coefficients for all the blocks in a macroblock. ACPRED is present in I-frames and in 1 MV intra macroblocks in predicted frames. ACPRED=0 generally indicates that AC prediction is not used in the macroblock, and ACPRED=1 generally indicates that AC prediction is used in the macroblock. The predictor block is either the block immediately above or to the left of the current block. However, in a predicted frame (e.g., a P-frame or B-frame), if the top predictor block and left predictor block are not Intra-coded, AC prediction is not used even if ACPRED=1.
The encoder and decoder also use signaling of AC prediction for interlaced frames. The ACPREDMB flag is a one-bit value present at macroblock level for frame-coded macroblocks that specifies whether AC prediction is used for all the blocks in the macroblock. The ACPREDTFIELD and ACPREDBFIELD flags are one-bit values present at macroblock level for field-coded macroblocks that specify whether AC prediction is used for blocks in the top and the bottom field of a current macroblock, respectively.
II. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 5, an interlaced video frame 500 includes top field 510 and bottom field 520. Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
Software for a previous WMV encoder and software for a previous decoder use macroblocks that are arranged according to a field structure (field-coded macroblocks) or a frame structure (frame-coded macroblocks) in interlaced video frames. FIG. 6 shows a structure for field-coded macroblocks in the encoder and decoder. An interlaced macroblock 610 is permuted such that all the top field lines (e.g., even-numbered lines 0, 2, . . . 14) are placed in the top half of the field-coded macroblock 620, and all the bottom field lines (e.g., odd-numbered lines 1, 3, . . . 15) are placed in the bottom half of the field-coded macroblock. For a frame-coded macroblock, the top field lines and bottom field lines alternate throughout the macroblock, as in interlaced macroblock 610.
The previous encoder and decoder use a 4:1:1 macroblock format in interlaced frames. A 4:1:1 macroblock is composed of four 8×8 luminance blocks and two 4×8 blocks of each chrominance channel. In a field-coded 4:1:1 macroblock, the permuted macroblock is subdivided such that the top two 8×8 luminance blocks and the top 4×8 chrominance block in each chrominance channel contain only top field lines, while the bottom two 8×8 luminance blocks and the bottom 4×8 chrominance block in each chrominance channel contain only bottom field lines.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
III. Signaling Frame/Field Mode for Interlaced Macroblocks
In software for a previous WMV encoder and decoder, the INTRLCF field is a one-bit, frame layer element used to signal whether macroblocks are coded in frame mode only, or in field or frame mode. If INTRLCF=0, all macroblocks in the frame are coded in frame mode. If INTRLCF=1, the macroblocks in the frame may be coded in field or frame mode, and the INTRLCMB field follows in the bitstream to indicate the frame/field coding status for each macroblock. INTRLCMB is a bitplane present in progressive I-frames, interlaced I-frames, interlaced P-frames and interlaced B-frames. The decoded INTRLCMB bitplane represents the interlaced status for each macroblock as a field of one-bit values in raster scan order from upper left to lower right. A value of 0 indicates that the corresponding macroblock is coded in frame mode. A value of 1 indicates that the corresponding macroblock is coded in field mode.
The field/frame coding mode is signaled for each macroblock in progressive I-frames, interlaced I-frames, interlaced P-frames and interlaced B-frames, and the field/frame coding mode is signaled only at frame level by a bitplane. No macroblock layer signaling option is available to signal field/frame coding mode, which limits the flexibility of the signaling.
IV. Bitplane Coding in Software for a Previous WMV Encoder and Decoder
In software for a previous WMV encoder and decoder, certain binary information for macroblocks in a frame is coded as a two-dimensional array in one of seven bitplane coding modes, and transmitted in a frame header.
The encoder and decoder use bitplane coding to signal four different kinds of binary information at frame level for macroblocks in a frame: (1) skipped/not skipped macroblocks, (2) field or frame coding mode in interlaced pictures, (3) one motion vector [“1 MV”] or four motion vector [“4MV”] coding mode; and (4) direct/not direct prediction mode in B-frames. The following syntax elements are used in the bitplane coding scheme.
INVERT
The INVERT field is a one bit code that indicates whether that the bitplane has more bits equal to 0 or more bits equal to 1. Depending on INVERT and the bitplane coding mode, the decoder may invert the decoded bitplane to recreate the original.
IMODE
The IMODE field is a variable-length code [“VLC”] representing the bitplane coding mode. In general, shorter codes are used to encode more frequently occurring coding modes.
DATABITS
The DATABITS field is an entropy-coded stream of symbols based on the coding mode signaled in the IMODE field. The size of each two-dimensional array is rowMB×colMB, where rowMB and colMB are the number of macroblock rows and columns, respectively, in the frame. Within the bitstream, each array is coded as a set of consecutive bits in one of seven bitplane coding modes. The seven bitplane coding modes are described below.
1. Raw Mode
In Raw mode, the bitplane is encoded as one bit per pixel scanned in the natural scan order. DATABITS is rowMB×colMB bits in length.
2. Row-Skip Mode
In Row-skip mode, the ROWSKIP field indicates whether the ROWBITS field is present for each row in the bitplane. If an entire row of values in the bitplane is zero, ROWSKIP=0 and ROWBITS is skipped. If at least one value in the row is non-zero, ROWSKIP=1 and ROWBITS contains one bit for each value in the row. Rows are scanned from the top to the bottom of the frame.
3. Column-Skip Mode
In Column-skip mode, the COLUMNSKIP field indicates whether the COLUMNBITS field is present for each column in the bitplane. If an entire column of values in the bitplane is zero, COLUMNSKIP=0 and COLUMNBITS is skipped. If at least one value in the column is non-zero, COLUMNSKIP=1 and COLUMNBITS contains one bit for each value in the column. Columns are scanned from the left to the right of the frame.
4. Normal-2 Mode
In Normal-2 mode, if rowMB×colMB is odd, the first symbol is simply represented with one bit matching its value, and subsequent symbols are encoded in pairs in natural scan order using a binary VLC table.
5. Normal-6 Mode
In Normal-6 mode, the bitplane is encoded in groups of six pixels. These pixels are grouped into either 2×3 or 3×2 tiles. The bitplane is tiled maximally using a set of tiling rules, and the remaining pixels are encoded using a variant of the Row-skip and Column-skip modes. 3×2 “vertical” tiles are used if and only if rowMB is a multiple of 3 and colMB is not. Otherwise, 2×3 “horizontal” tiles are used.
The six-element tiles are encoded first, followed by the Column-skip and Row-skip encoded linear tiles. If the array size is a multiple of 3×2 or of 2×3, the latter linear tiles do not exist and the bitplane is tiled with only six-element rectangular tiles.
6, 7. Diff-2 and Diff-6 Modes
If either differential mode (Diff-2 or Diff-6) is used, a bitplane of “differential bits” is decoded using the corresponding normal modes (Normal-2 or Normal-6, respectively). The differential bits are used to regenerate the original bitplane.
For more information on bitplane coding, see U.S. patent application Ser. No. 10/321,415, entitled “Skip Macroblock Coding,” filed Dec. 16, 2002.
V. Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG-2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union [“ITU”]. These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. These standards use (or support the use of) different combinations of intraframe and interframe decompression and compression.
A. Signaling Field- or Frame-Coded Macroblocks in the Standards
Some international standards describe signaling of field coding or frame coding for macroblocks in interlaced pictures.
Draft JVT-d157 of the JVT/AVC standard describes the mb_field_decoding_flag syntax element, which is used to signal whether a macroblock pair is decoded in frame mode or field mode in interlaced P-frames. Section 7.3.4 describes a bitstream syntax where mb_field_decoding_flag is sent as an element of slice data in cases where a sequence parameter (mb_frame_field_adaptive_flag) indicates switching between frame and field decoding in macroblocks and a slice header element (pic_structure) identifies the picture structure as an interlaced frame picture.
The May 28, 1998 committee draft of MPEG-4 describes the dct_type syntax element, which is used to signal whether a macroblock is frame DCT coded or field DCT coded. According to Sections 6.2.7.3 and 6.3.7.3, dct_type is a macroblock-layer element that is only present in the MPEG-4 bitstream in interlaced content where the macroblock has a non-zero coded block pattern or is intra-coded.
In MPEG-2, the dct_type element is also a macroblock-layer element that indicates whether a macroblock is frame DCT coded or field DCT coded. MPEG-2 also describes a picture coding extension element frame_pred_frame_dct. When frame_pred_frame_dct is set to ‘1’, only frame DCT coding is used in interlaced frames. The condition dct_type=0 is “derived” when frame_pred_frame_dct=1 and the dct_type element is not present in the bitstream.
B. Signaling AC Coefficient Prediction in the Standards
Some international standards describe signaling of different spatial AC coefficient prediction modes for macroblocks.
The May 28, 1998 committee draft of MPEG-4 describes the ac_pred_flag syntax element, which is a one-bit flag for signaling whether AC coefficients in the first row or column of an intra macroblock are differentially coded. In the MPEG-4 bitstream, ac_pred_flag is sent on a one bit per macroblock basis in a data partitioning data structure of a video object plane (e.g., data_partitioned_I_VOP( ), data_partitioned_P_VOP( )) or in a macroblock layer data structure (macroblock( )).
In the H.263 standard, Annex I describes an advanced intra coding mode that optionally uses AC prediction. The macroblock layer element INTRA_MODE is a variable length code that signals whether a macroblock is encoded in a mode that uses AC prediction.
C. Limitations of the Standards
These international standards are limited in several important ways. For example, although the standards provide for signaling of field/frame type information and AC prediction, the signaling is typically performed on a one bit per macroblock basis.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.