Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an 8-bit luminance sample (also called a luma sample) that defines the grayscale component of the pixel and two 8-bit chrominance samples (also called chroma samples) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to preceding and/or following frames (typically called (having alternating lines for video fields), or an interlaced video field. For progressive frames, intra-picture compression techniques compress individual frames (typically called I-frames or key frames), and inter-picture compression techniques compress frames (typically called predicted frames, P-frames, or B-frames) with reference to preceding and/or following frames (typically called reference or anchor frames).
Inter-picture compression techniques often use motion estimation and motion compensation. For motion estimation, for example, an encoder divides a current predicted field or frame into 8×8 or 16×16 pixel units. For a unit of the current field or frame, a similar unit in a reference field or frame is found for use as a predictor. A motion vector indicates the location of the predictor in the reference field or frame. The encoder computes the sample-by-sample difference between the current unit and the predictor to determine a residual (also called error signal). If the current unit size is 16×16 , the residual is divided into four 8×8 blocks. To each 8×8 residual, the encoder applies a reversible frequency transform operation, which generates a set of frequency domain (i.e., spectral) coefficients. A discrete cosine transform [“DCT”] is a type of frequency transform. The resulting blocks of spectral coefficients are quantized and entropy encoded. If the predicted field or frame is used as a reference for subsequent motion compensation, the encoder reconstructs the predicted field or frame. When reconstructing residuals, the encoder reconstructs transforms coefficients (e.g., DCT coefficients) that were quantized and performs an inverse frequency transform such as an inverse DCT [“IDCT”]. The encoder performs motion compensation to compute the predictors, and combines the predictors with the residuals. During decoding, a decoder typically entropy decodes information and performs analogous operations to reconstruct residuals, perform motion compensation, and combine the predictors with the residuals.
I. Inter-Frame Compression in Windows Media Video, Version 8 [“WMV8”]
Microsoft Corporation's Windows Media Video, Version 8[“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra-frame and inter-frame compression, and the WMV8 decoder uses intra-frame and inter-frame decompression. When processing 8×8 blocks of motion compensation prediction residuals, the WMV8 encoder/decoder may switch between different sizes of DCT/IDCT. In particular, the encoder/decoder may use of one of an 8×8 DCT/IDCT, two 4×8 DCT/IDCTs, or two 8×4 DCT/IDCTs for a prediction residual block.
For example, FIG. 1 shows transform coding and compression of an 8×8 prediction error block (110) using two 8×4 DCTs (140). A video encoder computes (108) an error block (110) as the difference between a predicted block (102) and a current 8×8 block (104). The video encoder applies either an 8×8 DCT (not shown), two 8×4 DCTs (140), or two 4×8 DCTs (not shown) to the error block (110). For the 8×4 DCT (140), the error block (110) becomes two 8×4 blocks of DCT coefficients (142, 144), one for the top half of the error block (110) and one for the bottom half. The encoder quantizes (146) the data, which typically results in many of the coefficients being remapped to zero. The encoder scans (150) the blocks of quantized coefficients (147, 148) into one-dimensional arrays (152, 154) with 32 elements each, such that coefficients are generally ordered from lowest frequency to highest frequency in each array. In the scanning, the encoder uses a scan pattern for the 8×4 DCT. (For other size transforms, the encoder uses different scan patterns.) The encoder entropy codes the data in the one-dimensional arrays (152, 154) using a combination of run length coding (180) and variable length encoding (190) with one or more run/level/last tables (185).
FIG. 2 shows decompression and inverse transform coding of an 8×8 prediction error block (210) using two 8×4 IDCTs (240). The decoder may also perform inverse transform coding using a 4×8 IDCT or 8×8 IDCT (not shown). The decoder entropy decodes data into one-dimensional arrays (252, 254) of quantized coefficients using a combination of variable length decoding (290) and run length decoding (280) with one or more run/level/last tables (285). The decoder scans (250) the data into blocks of quantized DCT coefficients (247, 248) using the scan pattern for the 8×4 DCT. (The decoder uses other scan patterns for an 8×8 or 4×8 DCT.) The decoder inverse quantizes (246) the data and applies (240) an 8×4 IDCT to the coefficients, resulting in an 8×4 block (212) for the top half of the error block (210) and an 8×4 block (214) for the bottom half of the error block (210). The decoder combines the error block (210) with a predicted block (202) (from motion compensation) to form a reconstructed 8×8 block (204).
The WMV8 encoder and decoder can adaptively change the transform size used for residuals at frame level, macroblock level, or block level. Basically, a frame-level flag (0/1 decision bit) indicates whether one transform type is used for all coded blocks in the frame. If so, the transform type is signaled at frame level. If not (i.e., if different transform types are used within the frame), a macroblock-level flag present for each coded macroblock indicates whether a single transform type is used for all coded blocks in the macroblock. If so, the transform type is signaled at macroblock level. If not (i.e., if different transform types are used within the macroblock), the transform types for the respective coded blocks are signaled at block level. Table 1 shows variable length codes [“VLCs”] for transform types in WMV8.
TABLE 1VLCs for transform types in WMV8VLCTransform Type0 8×8 DCT108×4 DCT114×8 DCT
If the transform size is a subblock size, the WMV8 encoder outputs a subblock pattern code for the subblocks of a block. The subblock pattern indicates which subblocks of the block have additional coefficient information signaled and which do not. For example, for a block with 8×4 subblocks, the subblock pattern indicates the presence of additional signaled coefficient information, for only the bottom, only the top, or both the top and bottom 8×4 subblocks. For a block with 4×8 subblocks, the subblock pattern indicates the presence of additional signaled coefficient information for only the left, only the right, or both the left and right 4×8 subblocks. Table 2 shows VLCs for subblock patterns in WMV8.
TABLE 2VLCs for subblock patterns in WMV88×4 Subblock4×8 SubblockSUBBLKPATPatternPatternVLCTopBottomLeftRight0 XX10XXXX11XX
In WMV8, subblock pattern codes are used at block level, and only when the block uses a subblock transform size. The WMV8 decoder receives subblock pattern codes and determines whether additional coefficient information is present for particular subblocks of a block.
While transform size switching in WMV8 helps overall performance in many scenarios, there are opportunities for improvement. At an extreme, every block in a frame has a transform size specified for it. This requires a great deal of signaling overhead, which can negate the benefits provided by adaptive transform sizes.
In addition, in WMV8 switching level flags, transform types, and subblock patterns are independently signaled in the bitstream. At the macroblock level, for example, one bit indicates whether the transform signaling is at the macroblock level or block level, and if macroblock-level signaling is used, a VLC signals which of three transform types to use, 8×8, 8×4 , or 4×8. At the block level, separate VLCs are used for transform types and subblock patterns. Signaling of switching levels, transform types, and subblock patterns in WMV8 is inefficient in some cases and thus provides an opportunity for improvement in performance.
II. Video Codec Standards
Various standards specify aspects of video decoders as well as formats for compressed video information. These standards include H.261, MPEG-1, H.262 (also called MPEG-2), H.263, and MPEG-4. Directly or by implication, these standards may specify certain encoder details, but other encoder details are not specified. Different standards incorporate different techniques, but each standard typically specifies some kind of motion compensation and decompression of prediction residuals. For information, see the respective standard documents.
According to draft JVT-C167 of the H.264 standard, an encoder and decoder may use variable-size transforms. This feature is called adaptive block size transforms [“ABT”], which indicates adaptation of transform size to the block size used for motion compensation in inter coding. For intra coding, the transform size is adapted to the properties of the intra prediction signal. For ABT inter coding, the existing syntaxis used. For ABT intra coding, a new symbol is introduced into the macroblock-layer syntax to signal intra prediction mode. For additional information about ABT, see, e.g., section 14 of draft JVT-C167.
To signal the presence of coefficient information when ABT is used, an encoder uses the coded block pattern [“CBP”] syntax element that is also used in non-ABT coding. See, e.g., sections 7.3.18, 8.5.7, and 14.2.4 of draft JVT-C167. A CBP may provide coefficient signaling information on a block-by-block basis for a macroblock, but does not provide such information for specific subblocks. This can be inefficient, for example, if only one subblock for a block has signaled coefficient information.