With the increased popularity of DVDs, music delivery over the Internet, and digital cameras, digital media have become commonplace. Engineers use a variety of techniques to process digital audio, video, and images efficiently while still maintaining quality. To understand these techniques, it helps to understand how the audio, video, and image information is represented and processed in a computer.
I. Representation of Media Information in a Computer
A computer processes media information as a series of numbers representing that information. For example, a single number may represent the intensity of brightness or the intensity of a color component such as red, green or blue for each elementary small region of a picture, so that the digital representation of the picture consists of one or more arrays of such numbers. Each such number may be referred to as a sample. For a color image, it is conventional to use more than one sample to represent the color of each elemental region, and typically three samples are used. The set of these samples for an elemental region may be referred to as a pixel, where the word “pixel” is a contraction referring to the concept of a “picture element.” For example, one pixel may consist of three samples that represent the intensity of red, green and blue light necessary to represent the elemental region. Such a pixel type is referred to as an RGB pixel. Several factors affect quality of media information, including sample depth, resolution, and frame rate (for video).
Sample depth is a property normally measured in bits that indicates the range of numbers that can be used to represent a sample. When more values are possible for the sample, quality can be higher because the number can capture more subtle variations in intensity and/or a greater range of values. Resolution generally refers to the number of samples over some duration of time (for audio) or space (for images or individual video pictures). Images with higher resolution tend to look crisper than other images and contain more discernable useful details. Frame rate is a common term for temporal resolution for video. Video with higher frame rate tends to mimic the smooth motion of natural objects better than other video, and can similarly be considered to contain more detail in the temporal dimension. For all of these factors, the tradeoff for high quality is the cost of storing and transmitting the information in terms of the bit rate necessary to represent the sample depth, resolution and frame rate, as Table 1 shows.
TABLE 1Bit rates for different quality levels of raw videoResolutionBit RateBits Per Pixel(in pixels,Frame Rate(in millions(sample depth timesWidth ×(in framesof bits persamples per pixel)Height)per second)second)8 (value 0-255, monochrome)160 × 1207.51.224 (value 0-255, RGB)320 × 2401527.624 (value 0-255, RGB)640 × 48030221.224 (value 0-255, RGB)1280 × 720 601327.1
Despite the high bit rate necessary for storing and sending high quality video (such as HDTV), companies and consumers increasingly depend on computers to create, distribute, and play back high quality content. For this reason, engineers use compression (also called source coding or source encoding) to reduce the bit rate of digital media. Compression decreases the cost of storing and transmitting the information by converting the information into a lower bit rate form. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. For video frames, intra compression techniques compress individual frames, typically called I-frames or key frames. Inter compression techniques compress frames with reference to preceding and/or following frames, and inter-compressed frames are typically called predicted frames, P-frames, or B-frames.
II. Inter and Intra Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
A. Intra Compression
FIG. 1 illustrates block-based intra compression 100 of a block 105 of samples in a key frame in the WMV8 encoder. A block is a set of samples, for example, an 8×8 arrangement of samples. The WMV8 encoder splits a key video frame into 8×8 blocks and applies an 8×8 Discrete Cosine Transform [“DCT”] 110 to individual blocks such as the block 105. A DCT is a type of frequency transform that converts the 8×8 block of samples (spatial information) into an 8×8 block of DCT coefficients 115, which are frequency information. The DCT operation itself is lossless or nearly lossless. Compared to the original sample values, however, the DCT coefficients are more efficient for the encoder to compress since most of the significant information is concentrated in low frequency coefficients (conventionally, the upper left of the block 115) and many of the high frequency coefficients (conventionally, the lower right of the block 115) have values of zero or close to zero.
The encoder then quantizes 120 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 125. Quantization is lossy. Since low frequency DCT coefficients tend to have higher values, quantization typically results in loss of precision but not complete loss of the information for the coefficients. On the other hand, since high frequency DCT coefficients tend to have values of zero or close to zero, quantization of the high frequency coefficients typically results in contiguous regions of zero values. In addition, in some cases high frequency DCT coefficients are quantized more coarsely than low frequency DCT coefficients, resulting in greater loss of precision/information for the high frequency DCT coefficients.
The encoder then prepares the 8×8 block of quantized DCT coefficients 125 for entropy encoding, which is a form of lossless compression. The exact type of entropy encoding can vary depending on whether a coefficient is a DC coefficient (lowest frequency), an AC coefficient (other frequencies) in the top row or left column, or another AC coefficient.
The encoder encodes the DC coefficient 126 as a differential from the DC coefficient 136 of a neighboring 8×8 block, which is a previously encoded neighbor (e.g., top or left) of the block being encoded. (FIG. 1 shows a neighbor block 135 that is situated to the left of the block being encoded in the frame.) The encoder entropy encodes 140 the differential.
The entropy encoder can encode the left column or top row of AC coefficients as a differential from a corresponding left column or top row of the neighboring 8×8 block. This is an example of AC coefficient prediction. FIG. 1 shows the left column 127 of AC coefficients encoded as a differential 147 from the left column 137 of the neighboring (in reality, to the left) block 135. The differential coding increases the chance that the differential coefficients have zero values. The remaining AC coefficients are from the block 125 of quantized DCT coefficients.
The encoder scans 150 the 8×8 block 145 of quantized AC DCT coefficients into a one-dimensional array 155 and then entropy encodes the scanned AC coefficients using a variation of run length coding 160. The encoder selects an entropy code from one or more run/level/last tables 165 and outputs the entropy code.
B. Inter Compression
Inter compression in the WMV8 encoder uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIGS. 2 and 3 illustrate the block-based inter compression for a predicted frame in the WMV8 encoder. In particular, FIG. 2 illustrates motion estimation for a predicted frame 210 and FIG. 3 illustrates compression of a prediction residual for a motion-compensated block of a predicted frame.
For example, in FIG. 2, the WMV8 encoder computes a motion vector for a macroblock 215 in the predicted frame 210. To compute the motion vector, the encoder searches in a search area 235 of a reference frame 230. Within the search area 235, the encoder compares the macroblock 215 from the predicted frame 210 to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock. The motion vector is differentially coded with respect to a motion vector predictor.
After reconstructing the motion vector by adding the differential to the motion vector predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 215 using information from the reference frame 230, which is a previously reconstructed frame available at the encoder and the decoder. The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 215 itself.
FIG. 3 illustrates an example of computation and encoding of an error block 335 in the WMV8 encoder. The error block 335 is the difference between the predicted block 315 and the original current block 325. The encoder applies a DCT 340 to the error block 335, resulting in an 8×8 block 345 of coefficients. The encoder then quantizes 350 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 355. The encoder scans 360 the 8×8 block 355 into a one-dimensional array 365 such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy encodes the scanned coefficients using a variation of run length coding 370. The encoder selects an entropy code from one or more run/level/last tables 375 and outputs the entropy code.
FIG. 4 shows an example of a corresponding decoding process 400 for an inter-coded block. In summary of FIG. 4, a decoder decodes (410, 420) entropy-coded information representing a prediction residual using variable length decoding 410 with one or more run/level/last tables 415 and run length decoding 420. The decoder inverse scans 430 a one-dimensional array 425, storing the entropy-decoded information into a two-dimensional block 435. The decoder inverse quantizes and inverse DCTs (together, 440) the data, resulting in a reconstructed error block 445. In a separate motion compensation path, the decoder computes a predicted block 465 using motion vector information 455 for displacement from a reference frame. The decoder combines 470 the predicted block 465 with the reconstructed error block 445 to form the reconstructed block 475. An encoder also performs the inverse quantization, inverse DCT, motion compensation and combining to reconstruct frames for use as reference frames.
III. Lossy Compression and Quantization
The preceding section mentioned quantization, a mechanism for lossy compression, and entropy coding, also called lossless compression. Lossless compression reduces the bit rate of information by removing redundancy from the information without any reduction in fidelity. For example, a series of ten consecutive pixels that are all exactly the same shade of red could be represented as a code for the particular shade of red and the number ten as a “run length” of consecutive pixels, and this series can be perfectly reconstructed by decompression from the code for the shade of red and the indicated number (ten) of consecutive pixels having that shade of red. Lossless compression techniques reduce bit rate at no cost to quality, but can only reduce bit rate up to a certain point. Decreases in bit rate are limited by the inherent amount of variability in the statistical characterization of the input data, which is referred to as the source entropy.
In contrast, with lossy compression, the quality suffers somewhat but the achievable decrease in bit rate is more dramatic. For example, a series of ten pixels, each being a slightly different shade of red, can be approximated as ten pixels with exactly the same particular approximate red color. Lossy compression techniques can be used to reduce bit rate more than lossless compression techniques, but some of the reduction in bit rate is achieved by reducing quality, and the lost quality cannot be completely recovered. Lossy compression is often used in conjunction with lossless compression—in a system design in which the lossy compression establishes an approximation of the information and lossless compression techniques are applied to represent the approximation. For example, the series of ten pixels, each a slightly different shade of red, can be represented as a code for one particular shade of red and the number ten as a run-length of consecutive pixels. In general, an encoder varies quantization to trade off quality and bit rate. Coarser quantization results in greater quality reduction but allows for greater bit rate reduction. In decompression, the original series would then be reconstructed as ten pixels with the same approximated red color.
According to one possible definition, quantization is a term used for an approximating non-reversible mapping function commonly used for lossy compression, in which there is a specified set of possible output values, and each member of the set of possible output values has an associated set of input values that result in the selection of that particular output value. A variety of quantization techniques have been developed, including scalar or vector, uniform or non-uniform, and adaptive or non-adaptive quantization.
A. Scalar Quantizers
According to one possible definition, a scalar quantizer is an approximating functional mapping x→Q[x] of an input value x to a quantized value Q[x], sometimes called a reconstructed value. FIG. 5 shows a “staircase” I/O function 500 for a scalar quantizer. The horizontal axis is a number line for a real number input variable x, and the vertical axis indicates the corresponding quantized values Q[x]. The number line is partitioned by thresholds such as the threshold 510. Each value of x within a given range between a pair of adjacent thresholds is assigned the same quantized value Q[x]. For example, each value of x within the range 520 is assigned the same quantized value 530. (At a threshold, one of the two possible quantized values is assigned to an input x, depending on the system.) Overall, the quantized values Q[x] exhibit a discontinuous, staircase pattern. The distance the mapping continues along the number line depends on the system, typically ending after a finite number of thresholds. The placement of the thresholds on the number line may be uniformly spaced (as shown in FIG. 5) or non-uniformly spaced.
A scalar quantizer can be decomposed into two distinct stages. The first stage is the classifier stage, in which a classifier function mapping x→A[x] maps an input x to a quantization index A[x], which is often integer-valued. In essence, the classifier segments an input number line or data set. FIG. 6A shows a generalized classifier 600 and thresholds for a scalar quantizer. As in FIG. 5, a number line for a real number variable x is segmented by thresholds such as the threshold 610. Each value of x within a given range such as the range 620 is assigned the same quantized value Q[x]. FIG. 6B shows a numerical example of a classifier 650 and thresholds for a scalar quantizer.
In the second stage, a reconstructor functional mapping k→β[k] maps each quantization index k to a reconstruction value β[k]. In essence, the reconstructor places steps having a particular height relative to the input number line segments (or selects a subset of data set values) for reconstruction of each region determined by the classifier. The reconstructor functional mapping may be implemented, for example, using a lookup table. Overall, the classifier relates to the reconstructor as follows:Q[x]=β[A[x]]  (1).
In common usage, the term “quantization” is often used to describe the classifier stage, which is performed during encoding. The term “inverse quantization” is similarly used to describe the reconstructor stage, whether performed during encoding or decoding.
The distortion introduced by using such a quantizer may be computed with a difference-based distortion measure d(x−Q[x]). Typically, such a distortion measure has the property that d(x−Q[x]) increases as x−Q[x] deviates from zero; and typically each reconstruction value lies within the range of the corresponding classification region, so that the straight line that would be formed by the functional equation Q[x]=x will pass through every step of the staircase diagram (as shown in FIG. 5) and therefore Q[Q[x]] will typically be equal to Q[x]. In general, a quantizer is considered better in rate-distortion terms if the quantizer results in a lower average value of distortion than other quantizers for a given bit rate of output. More formally, a quantizer is considered better if, for a source random variable X, the expected (i.e., the average or statistical mean) value of the distortion measure D=EX{d(X−Q[X])} is lower for an equal or lower entropy H of A[X]. The most commonly-used distortion measure is the squared error distortion measure, for which d(|x−y|)=|x−y|2. When the squared error distortion measure is used, the expected value of the distortion measure ( D) is referred to as the mean squared error.
B. Dead Zone+Uniform Threshold Quantizers
A non-uniform quantizer has threshold values that are not uniformly spaced for all classifier regions. According to one possible definition, a dead zone plus uniform threshold quantizer [“DZ+UTQ”] is a quantizer with uniformly spaced threshold values for all classifier regions except the one containing the zero input value (which is called the dead zone [“DZ”]). In a general sense, a DZ+UTQ is a non-uniform quantizer, since the DZ size is different than the other classifier regions.
A DZ+UTQ has a classifier index mapping rule x→A[x] that can be expressed based on two parameters. FIG. 7 shows a staircase I/O function 700 for a DZ+UTQ, and FIG. 8A shows a generalized classifier 800 and thresholds for a DZ+UTQ. The parameter s, which is greater than 0, indicates the step size for all steps other than the DZ. Mathematically, all si are equal to s for i≠0. The parameter z, which is greater than or equal to 0, indicates the ratio of the DZ size to the size of the other steps. Mathematically, s0=z·s. In FIG. 8A, z is 2, so the DZ is twice as wide as the other classification zones. The index mapping rule x→A[x] for a DZ+UTQ can be expressed as:
                                          A            ⁡                          [              x              ]                                =                                    sign              ⁡                              (                x                )                                      *                          max              ⁡                              (                                  0                  ,                                      ⌊                                                                                                                      x                                                                          s                                            -                                              z                        2                                            +                      1                                        ⌋                                                  )                                                    ,                            (        2        )            where └.┘ denotes the smallest integer less than or equal to the argument and where sign(x) is the function defined as:
                              sign          ⁡                      (            x            )                          =                  {                                                                                          +                    1                                    ,                                                                                                  for                    ⁢                                                                                  ⁢                    x                                    ≥                  0                                                                                                                          -                    1                                    ,                                                                                                  for                    ⁢                                                                                  ⁢                    x                                    <                  0.                                                                                        (        3        )            
FIG. 8B shows a numerical example of a classifier 850 and thresholds for a DZ+UTQ with s=1 and z=2. FIGS. 5, 6A, and 6B show a special case DZ+UTQ with z=1. Quantizers of the UTQ form have good performance for a variety of statistical sources. In particular, the DZ+UTQ form is optimal for the statistical random variable source known as the Laplacian source.
In some system designs (not shown), an additional consideration may be necessary to fully characterize a DZ+UTQ classification rule. For practical reasons there may be a need to limit the range of values that can result from the classification function A[x] to some reasonable finite range. This limitation is referred to as clipping. For example, in some such systems the classification rule could more precisely be defined as:
                                          A            ⁡                          [              x              ]                                =                                    sign              ⁡                              (                x                )                                      *                          min              ⁡                              [                                  g                  ,                                      max                    ⁡                                          (                                              0                        ,                                                  ⌊                                                                                                                                                    x                                                                                            s                                                        -                                                          z                              2                                                        +                            1                                                    ⌋                                                                    )                                                                      ]                                                    ,                            (        4        )            where g is a limit on the absolute value of A[x].
Different reconstruction rules may be used to determine the reconstruction value for each quantization index. Standards and product specifications that focus only on achieving interoperability will often specify reconstruction values without necessarily specifying the classification rule. In other words, some specifications may define the functional mapping k→β[k] without defining the functional mapping x→A[x]. This allows a decoder built to comply with the standard/specification to reconstruct information correctly. In contrast, encoders are often given the freedom to change the classifier in any way that they wish, while still complying with the standard/specification.
Numerous systems for adjusting quantization thresholds have been developed. Many standards and products specify reconstruction values that correspond to a typical mid-point reconstruction rule (e.g., for a typical simple classification rule) for the sake of simplicity. For classification, however, the thresholds can in fact be adjusted so that certain input values will be mapped to more common (and hence, lower bit rate) indices, which makes the reconstruction values closer to optimal.
In many systems, the extent of quantization is measured in terms of quantization step size. Coarser quantization uses larger quantization step sizes, corresponding to wider ranges of input values. Finer quantization uses smaller quantization step sizes. Often, for purposes of signaling and reconstruction, quantization step sizes are parameterized as multiples of a smallest quantization step size.
C. Quantization Artifacts
As mentioned above, lossy compression tends to cause a decrease in quality. For example, a series of ten samples of slightly different values can be approximated using quantization as ten samples with exactly the same particular approximate value. This kind of quantization can reduce the bit rate of encoding the series of ten samples, but at the cost of lost detail in the original ten samples.
In some cases, quantization produces visible artifacts that tend to be more artificial-looking and visually distracting than simple loss of fine detail. For example, smooth, un-textured content is susceptible to contouring artifacts—artifacts that appear between regions of two different quantization output values—because the human visual system is sensitive to subtle variations (particularly luma differences) in smooth content. Using the above example, consider a case where the luma values of the series of ten samples change gradually and consistently from the first sample to the tenth sample. Quantization may approximate the first five sample values as one value and the last five sample values as another value. While this kind of quantization may not create visible artifacts in textured areas due to masking effects, in smooth regions it can create a visible line or step in the reconstructed image between the two sets of five samples.
IV. Differential Quantization in VC-1
In differential quantization, an encoder varies quantization step sizes (also referred to herein as quantization parameters or QPs in some implementations) for different parts of a picture. Typically, this involves varying QPs on a macroblock level or other sub-picture level. The encoder makes decisions on how to vary the QPs, and signals those decisions, as appropriate, to a decoder.
For example, a VC-1 encoder optionally chooses differential quantization for compression. The encoder sends a bitstream element (DQUANT) at a syntax level above picture level to indicate whether or not the QP can vary among the macroblocks in individual pictures. The encoder sends a picture-level bitstream element, PQINDEX, to indicate a picture QP. If DQUANT=0, the QP indicated by PQINDEX is used for all macroblocks in the picture. If DQUANT=1 or 2, different macroblocks in the same picture can use different QPs.
The VC-1 encoder can use more than one approach to differential quantization. In one approach, only two different QPs are used for a picture. This is referred to as bi-level differential quantization. For example, one QP is used for macroblocks at picture edges and another QP is used for macroblocks in the rest of the picture. This can be useful for saving bits at picture edges, where fine detail is less important for maintaining overall visual quality. Or, a 1-bit value signaled per macroblock indicates which of two available QP values to use for the macroblock. In another approach, referred to as multi-level differential quantization, a larger number of different QPs can be used for individual macroblocks in a picture.
The encoder sends a picture-level bitstream element, VOPDQUANT, when DQUANT is non-zero. VOPDQUANT is composed of other elements, potentially including DQPROFILE, which indicates which parts of the picture can use QPs other than the picture QP. When DQPROFILE indicates that arbitrary, different macroblocks can use QPs other than the picture QP, the bitstream element DQBILEVEL is present. If DQBILEVEL=1, each macroblock uses one of two QPs (bi-level quantization). If DQBILEVEL=0, each macroblock can use any QP (multi-level quantization).
The bitstream element MQDIFF is sent at macroblock level to signal a 1-bit selector for a macroblock for bi-level quantization. For multi-level quantization, MQDIFF indicates a differential between the picture QP and the macroblock QP or escape-coded absolute QP for a macroblock.
V. Other Standards and Products
Numerous international standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, these standards also specify certain encoder details, but other encoder details are not specified. Some standards address still image compression/decompression, and other standards address audio compression/decompression. Numerous companies have produced encoders and decoders for audio, still images, and video. Various other kinds of signals (for example, hyperspectral imagery, graphics, text, financial information, etc.) are also commonly represented and stored or transmitted using compression techniques.
Various video standards allow the use of different quantization step sizes for different picture types, and allow variation of quantization step sizes for rate and quality control.
Standards typically do not fully specify the quantizer design. Most allow some variation in the encoder classification rule x→A[x] and/or the decoder reconstruction rule k→β[k]. The use of a DZ ratio z=2 or greater has been implicit in a number of encoding designs. For example, the spacing of reconstruction values for predicted regions in some standards implies use of z≧2. Reconstruction values in these examples from standards are spaced appropriately for use of DZ+UTQ classification with z=2. Designs based on z=1 (or at least z<2) have been used for quantization in several standards. In these cases, reconstruction values are equally spaced around zero and away from zero.
Given the critical importance of video compression to digital video, it is not surprising that video compression is a richly developed field. Whatever the benefits of previous video compression techniques, however, they do not have the advantages of the following techniques and tools.