Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression, where a picture is, for example, a progressively scanned video frame, an interlaced video frame (having alternating lines for video fields), or an interlaced video field. Intra-picture compression techniques compress individual pictures (typically called I-pictures or key pictures), and inter-picture compression techniques compress pictures (typically called predicted pictures, P-pictures, or B-pictures) with reference to a preceding and/or following picture (typically called a reference or anchor picture) or pictures (for B-pictures).
I. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing in raster scan fashion (left to right, top to bottom) through successive, non-alternating lines to the bottom of the frame.
The primary aspect of interlaced video is that the raster scan of an entire video frame is performed in two passes by scanning alternate lines in each pass. For example, the first scan is made up of the even lines of the frame and the second scan is made up of the odd lines of the scan. This results in each frame containing two fields representing two different time epochs. FIG. 1 shows an interlaced video frame (100) that includes top field (110) and bottom field (120). In the frame (100), the even-numbered lines (top field) are scanned starting at one time (e.g., time t), and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure for coding, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
II. Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Versions 8 [“WMV8”] and 9 [“WMV9”] each include a video encoder and a video decoder. The encoders use intra-frame and inter-frame compression, and the decoders use intra-frame and inter-frame decompression.
A. Intra-frame Compression and Decompression
FIG. 2 illustrates block-based intra compression in the encoder, which reduces bit rate by removing spatial redundancy in a picture. In particular, FIG. 2 illustrates compression of an 8×8 block (205) of samples of an intra frame by the encoder. The encoder splits the frame into 8×8 blocks of samples and applies an 8×8 frequency transform (210) (such as a discrete cosine transform [“DCT”]) to individual blocks such as the block (205). The encoder quantizes (220) the transform coefficients (215), resulting in an 8×8 block of quantized transform coefficients (225).
Further encoding varies depending on whether a coefficient is a DC coefficient (the top left coefficient), an AC coefficient in the top row or left column, or another AC coefficient. The encoder typically encodes the DC coefficient (226) as a differential from the DC coefficient (236) of a neighboring 8×8 block, which is a previously encoded and decoded/reconstructed top or left neighbor block. The encoder entropy encodes (240) the differential.
The entropy encoder can encode the left column or top row of AC coefficients as differentials from AC coefficients a corresponding left column or top row of a neighboring 8×8 block. FIG. 2 shows the left column (227) of AC coefficients encoded as differentials (247) from the left column (237) of the neighboring (actually situated to the left) block (235).
The encoder scans (250) the 8×8 block (245) of predicted, quantized AC coefficients into a one-dimensional array (255). For the scanning, the encoder uses a scan pattern that depends on the DC/AC prediction direction, as described below. The encoder then entropy encodes the scanned coefficients using a variation of run/level coding (260). The encoder selects variable length codes [“VLCs”] from run/level/last tables (265) and outputs the VLCs.
FIG. 3 shows an example of corresponding decoding (300) for an intra-coded block by the decoder. In particular, FIG. 3 illustrates decompression of an 8×8 block of samples of an intra frame by the decoder to produce a reconstructed version (305) of the original 8×8 block (205).
The decoder receives and decodes (370) VLCs with run/level/last tables (365). The decoder run/level decodes (360) AC coefficients and puts the results into a one-dimensional array (355), from which the AC coefficients are inverse scanned (350) into a two-dimensional block (345). (The scan patterns are described below.)
The AC coefficients of the left column or top row of the block (345) may be differentials, in which case the decoder combines them with corresponding AC coefficients from a neighboring 8×8 block. In FIG. 3, the left column (347) of AC coefficients are differentials, and they are combined with AC coefficients of the left column (337) of a neighboring (actually situated to the left) block (335) to produce a left column (327) of AC coefficients in a block (325) of quantized transform coefficients.
To decode the DC coefficient (326), the decoder decodes (340) a DC differential. The decoder combines the DC differential with a DC coefficient (336) of a neighboring 8×8 block to produce the DC coefficient (326) of the block (325) of quantized transform coefficients.
The decoder inverse quantizes (320) the quantized transform coefficients of the block (325), resulting in a block (315) of transform coefficients. The decoder applies an inverse frequency transform (310) to the block (315) of transform coefficients, producing the reconstructed version (305) of the original 8×8 block (205).
B. Inter-frame Compression and Decompression
FIG. 4 illustrates block-based inter compression in an encoder, and FIG. 5 illustrates corresponding decompression in a decoder. Inter-frame compression techniques often use motion estimation and motion compensation, which reduces bit rate by removing temporal redundancy in a video sequence. Residual information after motion compensation is further compressed by removing spatial redundancy in it.
For example, for motion estimation an encoder divides a current predicted frame into 8×8 or 16×16 pixel units. For a unit of the current frame, a similar unit in a reference frame is found for use as a predictor. A motion vector indicates the location of the predictor in the reference frame. The encoder computes the sample-by-sample difference between the current unit and the predictor to determine a residual (also called error signal). If the current unit size is 16×16, the residual is divided into four 8×8 blocks. To each 8×8 residual, the encoder applies a reversible frequency transform operation, which generates a set of frequency domain (i.e., spectral) coefficients. The resulting blocks of transform coefficients are quantized and entropy encoded. If the predicted frame is used as a reference for subsequent motion compensation, the encoder reconstructs the predicted frame. When reconstructing residuals, the encoder reconstructs transform coefficients (e.g., DCT coefficients) that were quantized and performs an inverse frequency transform such as an inverse DCT [“IDCT”]. The encoder performs motion compensation to compute the predictors, and combines the predictors with the residuals. During decoding, a decoder typically entropy decodes information and performs analogous operations to reconstruct residuals, perform motion compensation, and combine the predictors with the residuals.
When processing 8×8 blocks of motion compensation prediction residuals, the WMV8 encoder/decoder may switch between different sizes of DCT/IDCT. In particular, the WMV8 encoder/decoder may use of one of an 8×8 DCT/IDCT, two 4×8 DCT/IDCTs, or two 8×4 DCT/IDCTs for an 8×8 prediction residual block. The WMV9 encoder/decoder may also use four 4×4 block size transforms for an 8×8 prediction residual block, but uses a transform other than DCT. FIG. 6 illustrates different transform block sizes for an 8×8 prediction residual block. Variable-size transform blocks allows the encoder to choose the block partition that leads to the lowest bit rate representation for a block.
In particular, FIG. 4 shows transform coding and compression of an 8×8 prediction error block (410) using two 8×4 DCTs (440). A video encoder computes (408) an error block (410) as the difference between a predicted block (402) and a current 8×8 block (404). The video encoder applies either an 8×8 DCT (not shown), two 8×4 DCTs (440), or two 4×8 DCTs (not shown) to the error block (410). For the 8×4 DCT (440), the error block (410) becomes two 8×4 blocks of DCT coefficients (442, 444), one for the top half of the error block (410) and one for the bottom half. The encoder quantizes (446) the data, which typically results in many of the coefficients being remapped to zero. The encoder scans (450) the blocks of quantized coefficients (447, 448) into one-dimensional arrays (452, 454) with 32 elements each, such that coefficients are generally ordered from lowest frequency to highest frequency in each array. In the scanning, the encoder uses a scan pattern for the 8×4 DCT, as described below. (For other size transforms, the encoder uses different scan patterns, as described below.) The encoder entropy codes the data in the one-dimensional arrays (452, 454) using a combination of run length coding (480) and variable length encoding (490) with one or more run/level/last tables (485).
FIG. 5 shows decompression and inverse transform coding of an 8×8 prediction error block (510) using two 8×4 IDCTs (540). The decoder may also perform inverse transform coding using a 4×8 IDCT or 8×8 IDCT (not shown). The decoder entropy decodes data into one-dimensional arrays (552, 554) of quantized coefficients using a combination of variable length decoding (590) and run length decoding (580) with one or more run/level/last tables (585). The decoder scans (550) the data into blocks of quantized DCT coefficients (547, 548) using the scan pattern for the 8×4 DCT. (The decoder uses other scan patterns for an 8×8 or 4×8 DCT.) The decoder inverse quantizes (546) the data and applies (540) an 8×4 IDCT to the coefficients, resulting in an 8×4 block (512) for the top half of the error block (510) and an 8×4 block (514) for the bottom half of the error block (510). The decoder combines the error block (510) with a predicted block (502) (from motion compensation) to form a reconstructed 8×8 block (504).
C. Scan Patterns in WMV8 and WMV9
During encoding, it is common for most of the transform coefficients of a transform block to have a value of zero after quantization. A good scan pattern gives higher priority to coefficients that are more likely to have non-zero values. In other words, such coefficients are scanned earlier in the scan pattern. In this way, the non-zero coefficients are more likely to be bunched together, followed by one or more long groups of zero value coefficients. In particular, this leads to more efficient run/level/last coding, but other forms of entropy coding also benefit from the reordering.
A WMV8 encoder and decoder use different scan patterns for different size transform blocks. FIGS. 7A through 7F show scan patterns for different block sizes and intra or inter compression types according to WMV8. In general, the same scan patterns are used for progressive frames and interlaced frames.
FIGS. 7A through 7C show scan patterns for intra-coded blocks of I-pictures. In general, one of the three scan arrays is used for a given intra-coded block depending on the AC prediction status for the block. If AC prediction is from the top, the horizontal scan pattern shown in FIG. 7B is used. If AC prediction is from the left, the vertical scan pattern shown in FIG. 7C is used. And if no AC prediction is used, the normal scan pattern shown in FIG. 7A is used.
FIGS. 7D through 7F show scan patterns for blocks of P-pictures. FIG. 7D shows a scan pattern for an intra-coded block or 8×8 inter-coded block in a P-picture. FIGS. 7E and 7F show scan patterns for inter-coded 8×4 and 4×8 blocks, respectively, in a P-picture.
A WMV9 encoder and decoder also use the scan patterns shown in FIGS. 7A through 7F. FIG. 7G shows a scan pattern for 4×4 inter-coded block size, which is another block size option according to WMV9. Again, the same scan patterns are used for progressive frames and interlaced frames (with the exception of scanning chrominance transform coefficients, which are in 4×8 blocks due to the chrominance sampling pattern in WMV9).
While the scan patterns in WMV8 and WMV9 help overall performance in many scenarios, there are opportunities for improvement. In particular, the 8×4 and 4×8 inter scan patterns are not particularly well suited for many common configurations of non-zero transform coefficients for progressive video in 8×4 and 4×8 inter-coded blocks. As a result, the scan patterns often provide sub-optimal re-ordering of transform coefficients for progressive video in 8×4 and 4×8 inter-coded blocks, which hurts the efficiency of subsequent entropy coding. Since the scan pattern affects block-level operations, and since each video picture can include hundreds or thousands of blocks, even a small change in efficiency can dramatically affect overall compression results.
Similarly, the scan patterns are not particularly well suited for common configurations of non-zero transform coefficients in inter-coded blocks of interlaced video. Again, this hurts the efficiency of subsequent entropy coding and can dramatically affect overall compression results.
III. Video Codec Standards
Various standards specify aspects of video decoders as well as formats for compressed video information. These standards include H.261, MPEG-1, H.262 (also called MPEG-2), H.263, and MPEG-4. Directly or by implication, these standards may specify certain encoder details, but other encoder details are not specified. Different standards incorporate different techniques, but each standard typically specifies one or more scan patterns for transform coefficients as briefly discussed below. For additional detail, see the respective standards.
A. Scan Patterns in the H.261 and MPEG Standard
The H.261 standard describes a transmission order for transform coefficients in an 8×8 block (compressed using intra-compression or motion compensation). The transform coefficients are run/level coded in the transmission order shown in FIG. 8. The transmission order proceeds from the low frequency coefficient at the upper left of the block in a neutral, zigzag pattern down to the highest frequency coefficient at the bottom right of the block.
The scan pattern described in the MPEG standard is basically the same as the scan pattern shown in FIG. 8. AC coefficients of intra-coded blocks are processed according to the scan pattern. The DC coefficient and AC coefficients of inter-coded blocks are processed according to the scan pattern.
The scan pattern shown in FIG. 8 is neutral in that neither rows nor columns are favored over the other in the ordering. The effectiveness of the scan pattern is limited in that it does not work for block sizes other than 8×8 and is not well suited for interlaced video.
B. Scan Patterns in the H.262 Standard
The H.262 standard describes two different 8×8 scan patterns. The syntax element “alternate_scan” (signaled at picture layer in the bitstream) indicates which of the two scan patterns to use for a picture.
The first H.262 scan pattern is the scan pattern described in the MPEG-1 standard, which is a neutral, zigzag scan pattern. FIG. 9 shows the other H.262 scan pattern. The scan pattern shown in FIG. 9 is biased in the vertical direction in that columns are favored over rows in the ordering. For example, columns are scanned earlier in the order such that the first column finishes before the first row, the second column finishes before the second row, etc. The scan patterns still do not work for block sizes other than 8×8, however. Moreover, the use of an additional bit per picture just for this purpose adds to bit rate and encoder/decoder complexity.
C. Scan Patterns in the H.263 and MPEG-4 Standards
The H.263 standard describes three different 8×8 scan patterns. The type of prediction used for DC and AC coefficients for a block indicates which scan pattern to use. If no AC prediction is used for an intra-coded block, and for all non-intra-coded blocks, the neutral, zigzag scan pattern shown in FIG. 8 is chosen. If the vertically adjacent block is used to predict the DC coefficient and top row of AC coefficient of the current intra-coded block, the scanning pattern shown in FIG. 10 (H.263 alternate horizontal scan) is chosen to scan the stronger, horizontal frequencies prior to the vertical ones. On the other hand, if the horizontally adjacent block is used to predict the DC coefficient and left column of AC coefficient of the current intra-coded block, a scanning pattern like the one shown in FIG. 9 is chosen to scan the stronger, vertical frequencies prior to the horizontal ones. (The H.263 alternate vertical scan has the same pattern as the alternate scan in H.262, with the numbers are increased by 1 throughout.)
Similarly, the MPEG-4 standard describes a neutral scan pattern, an alternate horizontal scan pattern, and an alternate vertical scan pattern, where the scan pattern used depends on whether or not the block is intra-coded and the prediction direction.
As with various previous standards, the H.263 and MPEG-4 scan patterns do not work for block sizes other than 8×8.
D. Scan Patterns in the Drafts of the H.264 Standard
Draft JVT-D157 of the H.264 standard describes two sets of four scan patterns—a first set of four neutral, zigzag scan patterns and a second set of four field scan patterns. By default, the 4×4 scan pattern shown in FIG. 11A is used for 4×4 partitions of blocks. The 4×4 scan pattern is a neutral, zigzag pattern that biases neither the horizontal nor vertical direction.
If adaptive block size transforms are used, however, additional scan patterns are used for different block sizes. With adaptive block size transforms, transform block sizes of 4×8, 8×4, and 8×8 (in addition to 4×4) are available for luminance motion compensation residual information. Draft JVT-D157 describes decoding of either progressive frames or interlaced frames, which may be mixed together in the same video sequence. For blocks encoded/decoded in frame mode (e.g., for progressive frames), the zigzag scan patterns shown in FIGS. 8 and 11A through 11C are used for the respective block sizes. For blocks encoded/decoded in field mode (e.g., for interlaced fields), the field scan patterns shown in FIGS. 11D through 11G are used for the respective block sizes.
While the scan patterns in JVT-D157 provide good performance in many scenarios, there are opportunities for improvement. In particular, the 8×4 and 4×8 zigzag scan patterns are not particularly well suited for many common configurations of non-zero transform coefficients for progressive video in 8×4 and 4×8 inter-coded blocks, which hurts the efficiency of subsequent entropy coding. Moreover, the 8×8, 8×4, and 4×4 field scan patterns, while somewhat vertically biased, are not biased aggressively enough in the vertical direction to be effective for inter-coded blocks of interlaced video fields. The 4×8 field scan pattern is vertically biased, but fails to account for common configurations of non-zero transform coefficients in 4×8 inter-coded blocks of interlaced video fields. Each of these things hurts the efficiency of subsequent entropy coding and adversely affects overall compression results.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.