Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 pictures per second. Each picture can include tens or hundreds of thousands of pixels (also called pels). Each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel with 24 bits or more. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence can be 5 million bits/second or more.
Most computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression reverses compression.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. For progressively scanned video frames, intra compression techniques compress individual pictures, typically called I-frames or key frames. Inter compression techniques compress frames with reference to preceding and/or following frames, and inter-compressed frames are typically called predicted frames, P-frames, or B-frames.
I. Inter and Intra Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
A. Intra Compression
FIG. 1 illustrates block-based intra compression 100 of a block 105 of pixels in a key frame in the WMV8 encoder. A block is a set of pixels, for example, an 8×8 arrangement of pixels. The WMV8 encoder splits a key video frame into 8×8 blocks of pixels and applies an 8×8 Discrete Cosine Transform [“DCT”] 110 to individual blocks such as the block 105. A DCT is a type of frequency transform that converts the 8×8 block of pixels (spatial information) into an 8×8 block of DCT coefficients 115, which are frequency information. The DCT operation itself is lossless or nearly lossless. Compared to the original pixel values, however, the DCT coefficients are more efficient for the encoder to compress since most of the significant information is concentrated in low frequency coefficients (conventionally, the upper left of the block 115) and many of the high frequency coefficients (conventionally, the lower right of the block 115) have values of zero or close to zero.
The encoder then quantizes 120 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 125. For example, the encoder applies a uniform, scalar quantization step size to each coefficient. Quantization is lossy. Since low frequency DCT coefficients tend to have higher values, quantization results in loss of precision but not complete loss of the information for the coefficients. On the other hand, since high frequency DCT coefficients tend to have values of zero or close to zero, quantization of the high frequency coefficients typically results in contiguous regions of zero values. In addition, in some cases high frequency DCT coefficients are quantized more coarsely than low frequency DCT coefficients, resulting in greater loss of precision/information for the high frequency DCT coefficients.
The encoder then prepares the 8×8 block of quantized DCT coefficients 125 for entropy encoding, which is a form of lossless compression. The exact type of entropy encoding can vary depending on whether a coefficient is a DC coefficient (lowest frequency), an AC coefficient (other frequencies) in the top row or left column, or another AC coefficient.
The encoder encodes the DC coefficient 126 as a differential from the DC coefficient 136 of a neighboring 8×8 block, which is a previously encoded neighbor (e.g., top or left) of the block being encoded. (FIG. 1 shows a neighbor block 135 that is situated to the left of the block being encoded in the frame.) The encoder entropy encodes 140 the differential.
The entropy encoder can encode the left column or top row of AC coefficients as a differential from a corresponding column or row of the neighboring 8×8 block. FIG. 1 shows the left column 127 of AC coefficients encoded as a differential 147 from the left column 137 of the neighboring (to the left) block 135. The differential coding increases the chance that the differential coefficients have zero values. The remaining AC coefficients are from the block 125 of quantized DCT coefficients.
The encoder scans 150 the 8×8 block 145 of predicted, quantized AC DCT coefficients into a one-dimensional array 155 and then entropy encodes the scanned AC coefficients using a variation of run length coding 160. The encoder selects an entropy code from one or more run/level/last tables 165 and outputs the entropy code.
B. Inter Compression
Inter compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 2 and 3 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 2 illustrates motion estimation for a predicted frame 210 and FIG. 3 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 2, the WMV8 encoder computes a motion vector for a macroblock 215 in the predicted frame 210. To compute the motion vector, the encoder searches in a search area 235 of a reference frame 230. Within the search area 235, the encoder compares the macroblock 215 from the predicted frame 210 to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock.
The encoder can encode the differential between the motion vector and the motion vector predictor. After reconstructing the motion vector by adding the differential to the predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 215 using information from the reference frame 230, which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 215 itself.
FIG. 3 illustrates an example of computation and encoding of an error block 335 in the WMV8 encoder. The error block 335 is the difference between the predicted block 315 and the original current block 325. The encoder applies a DCT 340 to the error block 335, resulting in an 8×8 block 345 of coefficients. The encoder then quantizes 350 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 355. The encoder scans 360 the 8×8 block 355 into a one-dimensional array 365 such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding 370. The encoder selects an entropy code from one or more run/level/last tables 375 and outputs the entropy code.
FIG. 4 shows an example of a corresponding decoding process 400 for an inter-coded block. In summary of FIG. 4, a decoder decodes (410, 420) entropy-coded information representing a prediction residual using variable length decoding 410 with one or more run/level/last tables 415 and run length decoding 420. The decoder inverse scans 430 a one-dimensional array 425 storing the entropy-decoded information into a two-dimensional block 435. The decoder inverse quantizes and inverse discrete cosine transforms (together, 440) the data, resulting in a reconstructed error block 445. In a separate motion compensation path, the decoder computes a predicted block 465 using motion vector information 455 for displacement from a reference frame. The decoder combines 470 the predicted block 465 with the reconstructed error block 445 to form the reconstructed block 475.
The amount of change between the original and reconstructed frames is the distortion and the number of bits required to code the frame indicates the rate for the frame. The amount of distortion is roughly inversely proportional to the rate.
II. Interlaced Video and Progressive Video
A video frame contains lines of spatial information of a video signal. For progressive video, these lines contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. A progressive I-frame is an intra-coded progressive video frame. A progressive P-frame is a progressive video frame coded using forward prediction, and a progressive B-frame is a progressive video frame coded using bi-directional prediction.
A typical interlaced video frame consists of two fields scanned starting at different times. For example, referring to FIG. 5, an interlaced video frame 500 includes top field 510 and bottom field 520. Typically, the even-numbered lines (top field) are scanned starting at one time (e.g., time t) and the odd-numbered lines (bottom field) are scanned starting at a different (typically later) time (e.g., time t+1). This timing can create jagged tooth-like features in regions of an interlaced video frame where motion is present when the two fields are scanned starting at different times. For this reason, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures for reduction of such jagged edge artifacts. On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without such a rearrangement. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames, in which the original alternating field line arrangement is preserved.
A previous WMV encoder and decoder use macroblocks that are arranged according to a field structure (field-coded macroblocks) or a frame structure (frame-coded macroblocks) in interlaced video frames. FIG. 6 shows how field permuting is used to produce field-coded macroblocks in the encoder and decoder. An interlaced macroblock 610 is permuted such that all the top field lines (e.g., even-numbered lines 0, 2, . . . 14) are placed in the top half of the field-coded macroblock 620, and all the bottom field lines (e.g., odd-numbered lines 1, 3, . . . 15) are placed in the bottom half of the field-coded macroblock. For a frame-coded macroblock, the top field lines and bottom field lines alternate throughout the macroblock, as in interlaced macroblock 610.
The encoder and decoder use a 4:1:1 macroblock format in interlaced frames. A 4:1:1 macroblock is composed of four 8×8 luminance blocks and two 4×8 blocks of each chrominance channel. In a field-coded 4:1:1 macroblock, the permuted macroblock is subdivided such that the top two 8×8 luminance blocks and the top 4×8 chrominance block in each chrominance channel contain only top field lines, while the bottom two 8×8 luminance blocks and the bottom 4×8 chrominance block in each chrominance channel contain only bottom field lines.
A typical progressive video frame consists of one frame of content with non-alternating lines. In contrast to interlaced video, progressive video does not divide video frames into separate fields, and an entire frame is scanned left to right, top to bottom starting at a single time.
II. Loop Filtering in a Previous WMV Encoder and Decoder
Quantization and other lossy processing of prediction residuals can cause blocking artifacts at block boundaries. Blocking artifacts can be especially troublesome in reference frames that are used for motion estimation and compensation of subsequent predicted frames. To reduce blocking artifacts, a previous WMV video encoder and decoder use a deblocking filter to smooth boundary discontinuities between 8×8 blocks in motion estimation/compensation loops. For example, a video encoder processes a reconstructed reference frame to reduce blocking artifacts prior to motion estimation/compensation using the reference frame, and a video decoder processes a reconstructed reference frame to reduce blocking artifacts prior to motion compensation using the reference frame. The deblocking filter improves the quality of motion estimation/compensation, resulting in better prediction and lower bitrate for prediction residuals.
A. In-loop Deblocking Filtering for Progressive Frames
The encoder and decoder perform in-loop deblocking filtering for progressive frames prior to using a reconstructed frame as a reference for motion estimation/compensation. The filtering process operates on pixels (or more precisely, on samples at pixel locations) that border neighboring blocks. The locations of block boundaries depend on the size of the inverse transform used. For progressive P-frames the block boundaries may occur at every 4th or 8th pixel row or column depending on whether an 8×8, 8×4 or 4×8 inverse transform is used. For progressive I-frames, where an 8×8 transform is used, block boundaries occur at every 8th pixel row and column.
1. Progressive I-Frame In-Loop Deblocking Filtering
For progressive I-frames, deblocking filtering is performed adaptively at all 8×8 block boundaries. FIGS. 7 and 8 show the pixels that are filtered along the horizontal and vertical border regions in the upper left corner of a component (luma, Cb or Cr) plane. FIG. 7 shows filtered vertical block boundary pixels in an I-frame. FIG. 8 shows filtered horizontal block boundary pixels in an I-frame.
In FIGS. 7 and 8, crosses represent pixels (actually samples for pixels) and circled crosses represent filtered pixels. As these figures show, the top horizontal line and first vertical line in the frame are not filtered, even though they lie on a block boundary, because these lines lie on the border of the frame. Although not depicted, the bottom horizontal line and last vertical line in the frame also are not filtered for the same reason. In more formal terms, the following lines are filtered:                Horizontal lines: (7, 8), (15, 16) . . . ((N−1)*8−1, (N−1)*8)        Vertical lines: (7, 8), (15, 16) . . . ((M−1)*8−1, (M−1)*8)        (N=number of horizontal 8×8 blocks in the plane (N*8=horizontal frame size))        (M=number of vertical 8×8 blocks in the frame (M*8=vertical frame size))For progressive I-frames, all horizontal boundary lines in the frame are filtered first, followed by the vertical boundary lines.        
2. Progressive P-frame In-loop Deblocking Filtering
For progressive P-frames, blocks can be intra or inter-coded. The encoder and decoder use an 8×8 transform to transform the samples in intra-coded blocks, and the 8×8 block boundaries are always adaptively filtered. The encoder and decoder use an 8×8, 8×4, 4×8 or 4×4 transform for inter-coded blocks and a corresponding inverse transform to construct the samples that represent the residual error. Depending on the status of the neighboring blocks, the boundary between the current and neighboring blocks may or may not be adaptively filtered. The boundaries between coded (at least one non-zero coefficient) subblocks (8×4, 4×8 or 4×4) within an 8×8 block are always adaptively filtered. The boundary between a block or subblock and a neighboring block or subblock is not filtered only if both blocks are inter-coded, have the same motion vector, and have no residual error (no transform coefficients), otherwise the boundary is filtered.
FIG. 9 shows examples of when filtering between neighboring blocks does and does not occur in progressive P-frames. In FIG. 9, it is assumed that the motion vectors for both blocks are the same (if the motion vectors are different, the boundary is always filtered). The shaded blocks or subblocks represent the cases where at least one nonzero coefficient is present. Clear blocks or subblocks represent cases where no transform coefficients are present. Thick lines represent the boundaries that are adaptively filtered. Thin lines represent the boundaries that are not filtered. FIG. 9 illustrates only horizontal macroblock neighbors, but a previous WMV encoder and decoder applies similar rules to vertical neighbors.
FIGS. 10 and 11 show an example of pixels that may be filtered in a progressive P-frame. The crosses represent pixel locations and the circled crosses represent the boundary pixels that are adaptively filtered if the conditions specified above are met. FIG. 10 shows pixels filtered along horizontal boundaries. As FIG. 10 shows, the pixels on either side of the block or subblock boundary are candidates to be filtered. For the horizontal boundaries, this could be every 4th and 5th, 8th and 9th, 12th and 13th, etc., pixel row in the frame. FIG. 11 shows candidate pixels to be filtered along vertical boundaries. For the vertical boundaries, every 4th and 5th, 8th and 9th, 12th and 13th, etc., pixel column in the frame may be adaptively filtered as these are the 8×8 and 4×8 vertical boundaries. The first and last row and the first and last column in the frame are not filtered.
All the 8×8 block horizontal boundary lines in the frame are adaptively filtered first, starting from the top line. Next, all 8×4 block horizontal boundary lines in the frame are adaptively filtered starting from the top line. Next, all 8×8 block vertical boundary lines are adaptively filtered starting from the leftmost line. Lastly, all 4×8 block vertical boundary lines are adaptively filtered starting with the leftmost line. The rules specified above are used to determine whether the boundary pixels are actually filtered for each block or subblock.
3. Filtering Operations
For progressive P-frames the decision criteria described above determine which vertical and horizontal boundaries are adaptively filtered. Since the minimum number of consecutive pixels that are filtered in a row or column is four and the total number of pixels in a row or column is always a multiple of four, the filtering operation is performed on segments of four pixels.
For example, if the eight pixel pairs that make up the vertical boundary between two blocks are adaptively filtered, then the eight pixels are divided into two 4-pixel segments as shown in FIG. 12. In each 4-pixel segment, the third pixel pair is adaptively filtered first as indicated by the Xs in FIG. 12. The result of this adaptive filter operation determines whether the other three pixels in the segment are also filtered.
FIG. 13 shows the pixels that are used in the adaptive filtering operation performed on the 3rd pixel pair. In FIG. 13, pixels P4 and P5 are the pixel pair that may be changed in the filter operation.
The pseudo-code 1400 of FIG. 14 shows the adaptive filtering operation performed on the 3rd pixel pair in each segment. The value filter_other—3_pixels indicates whether the remaining three pixel pairs in the segment are also filtered. If filter_other—3_pixels=TRUE, then the other three pixel pairs are adaptively filtered. If filter_other—3_pixels=FALSE, then they are not filtered, and the adaptive filtering operation proceeds to the next 4-pixel segment. The pseudo-code 1500 of FIG. 15 shows the adaptive filtering operation that is performed on the 1st, 2nd and 4th pixel pair if filter_other—3_pixels=TRUE. In pseudo-code 1400 and pseudo-code 1500, the variable PQUANT represents a quantization step size.
The filtering operations described above are similarly used for filtering horizontal boundary pixels.
D. In-loop Deblocking Filtering for Interlaced Frames
The encoder and decoder perform in-loop deblocking filtering across vertical boundaries in interlaced frames having a 4:1:1 macroblock format. For interlaced I- and P-frames, adaptive filtering can occur for pixels located immediately on the left and right of a vertical block boundary except for those located on the picture boundaries (i.e., the first and last column of the luminance and chrominance components). In FIG. 16, pixels (more precisely, samples) that are candidates for filtering in a typical 4:1:1 macroblock in the encoder and decoder are marked M or B, where M denotes boundary pixels located across macroblock boundaries and B denotes boundary pixels located within the macroblock.
The decision on whether to filter across a vertical boundary is made on a block-by-block basis. In a 4:1:1 frame-coded macroblock, each block contains eight consecutive alternating lines of the top and bottom fields in the macroblock. In a 4:1:1 field-coded macroblock, a block contains either eight top field lines or eight bottom field lines. The filtering decision is made eight lines at a time.
The decision to filter across a vertical block boundary depends on whether the current block and the left neighboring block are frame-coded or field-coded (field/frame type), whether they are intra-coded or inter-coded, and whether they have nonzero transform coefficients. In general, the vertical block boundary pixels are adaptively filtered unless the current block's field/frame type is the same as the left neighboring block's field/frame type, both blocks are not intra-coded, and both have no nonzero transform coefficients, in which case the block boundary is not filtered. Chroma block boundaries are adaptively filtered if the corresponding luminance block boundaries are adaptively filtered. Horizontal boundaries are not filtered.
Although the encoder and decoder adaptively filter block boundaries depending in part on the field/frame type of the neighboring blocks, they do not take transform size into account when making filtering decisions in interlaced frames.
VI. Standards for Video Compression and Decompression
Several international standards relate to video compression and decompression. These standards include the Motion Picture Experts Group [“MPEG”] 1, 2, and 4 standards and the H.261, H.262 (another title for MPEG 2), H.263 and H.264 (also called JVT/AVC) standards from the International Telecommunication Union [“ITU”]. These standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, they also specify certain encoder details, but other encoder details are not specified. These standards use (or support the use of) different combinations of intraframe and interframe decompression and compression.
A. Loop Filtering in the Standards
As in the previous WMV encoders and decoders, some international standards use deblocking filters to reduce the effect of blocking artifacts in reconstructed frames. The H.263 standard includes an optional deblocking filter mode in which a filter is applied across 8×8 block edge boundaries of decoded I- and P-frames (but not B-frames) to reduce blocking artifacts. Annex J of the H.263 standard describes an optional block edge filter within the coding loop in which filtering is performed on 8×8 block edges (referred to in H.263 as a deblocking edge filter). This filter affects the reconstructed pictures used for prediction of other pictures. The deblocking edge filter operates using a set of four clipped pixel values on a horizontal and/or vertical line, where two of the four values are in one block (e.g., the top block among neighboring top and bottom blocks) and the other two values are in another block (e.g., the bottom block among neighboring top and bottom blocks). Filtering across horizontal edges is performed before filtering across vertical edges to reduce rounding effects. This optional filtering mode can be signaled in the bitstream with a single bit in a field of a picture header.
According to draft JVT-d157 of the JVT/AVC video standard, deblocking filtering is performed on a macroblock basis. In interlaced frames, macroblocks are grouped into macroblock pairs (top and bottom). Macroblock pairs can be field-coded or frame-coded. In a frame-coded macroblock pair, the macroblock pair is decoded as two frame-coded macroblocks. In a field-coded macroblock pair, the top macroblock consists of the top-field lines in the macroblock pair, and the bottom macroblock consists of the bottom-field lines in the macroblock pair.
Sections 8.7 and 12.4.4 of draft JVT-d157 describe deblocking filtering. For frame-coded macroblock pairs, deblocking is performed on the frame samples, and if neighboring macroblock pair is a field macroblock pair, the neighboring field macroblock pair is converted into a frame macroblock pair before deblocking. For field-coded macroblock pairs, deblocking is performed on the field samples of the same field parity, and if a neighboring macroblock pair is a frame macroblock pairs, it is converted into a field macroblock pair before deblocking. For field-coded pictures, all decoding operations for the deblocking filter are based solely on samples within the current field. For luma filtering in a 16×16 macroblock with 16 4×4 blocks, the 16 samples of the four vertical edges of the 4×4 raster scan pattern are filtered beginning with the left edge, and the four horizontal edges are filtered beginning with the top edge. For chroma filtering, two edges of eight samples each are filtered in each direction. For additional detail, see JVT-d157.
B. Limitations of the Standards
These international standards are limited in several important ways. For example, H.263 does not describe loop filtering for interlaced video. Draft JVT-d157 of the JVT/AVC video standard describes loop filtering only for macroblock pairs in interlaced video, and does not describe, for example, loop filtering for an individual field-coded macroblock having a top field and a bottom field within the same macroblock, or loop filtering decisions for blocks or sub-blocks larger than 4×4.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.