With the increased popularity of DVDs, music delivery over the Internet, and digital cameras, digital media have become commonplace. Engineers use a variety of techniques to process digital audio, video, and images efficiently while still maintaining quality. To understand these techniques, it helps to understand how the audio, video, and image information is represented and processed in a computer.
I. Representation of Media Information in a Computer
A computer processes media information as a series of numbers representing that information. For example, a single number may represent the intensity of brightness or the intensity of a color component such as red, green or blue for each elementary small region of a picture, so that the digital representation of the picture consists of one or more arrays of such numbers. Each such number may be referred to as a sample. For a color image, it is conventional to use more than one sample to represent the color of each elemental region, and typically three samples are used. The set of these samples for an elemental region may be referred to as a pixel, where the word “pixel” is a contraction referring to the concept of a “picture element.” For example, one pixel may consist of three samples that represent the intensity of red, green and blue light necessary to represent the elemental region. Such a pixel type is referred to as an RGB pixel. Several factors affect quality of media information, including sample depth, resolution, and frame rate (for video).
Sample depth is a property normally measured in bits that indicates the range of numbers that can be used to represent a sample. When more values are possible for the sample, quality can be higher because the number can capture more subtle variations in intensity and/or a greater range of values. Resolution generally refers to the number of samples over some duration of time (for audio) or space (for images or individual video pictures). Images with higher resolution tend to look crisper than other images and contain more discernable useful details. Frame rate is a common term for temporal resolution for video. Video with higher frame rate tends to mimic the smooth motion of natural objects better than other video, and can similarly be considered to contain more detail in the temporal dimension. For all of these factors, the tradeoff for high quality is the cost of storing and transmitting the information in terms of the bit rate necessary to represent the sample depth, resolution and frame rate, as Table 1 shows.
TABLE 1Bit rates for different quality levels of raw videoBit RateBits Per PixelResolutionFrame Rate(in millions(sample depth times(in pixels,(in framesof bits persamples per pixel)Width × Height)per second)second) 8 (value 0-255,160 × 1207.51.2monochrome)24 (value 0-255, RGB)320 × 2401527.624 (value 0-255, RGB)640 × 48030221.224 (value 0-255, RGB)1280 × 720 601327.1
Despite the high bit rate necessary for storing and sending high quality video (such as HDTV), companies and consumers increasingly depend on computers to create, distribute, and play back high quality content. For this reason, engineers use compression (also called source coding or source encoding) to reduce the bit rate of digital media. Compression decreases the cost of storing and transmitting the information by converting the information into a lower bit rate form. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. For video frames, intra compression techniques compress individual frames, typically called I-frames or key frames. Inter compression techniques compress frames with reference to preceding and/or following frames, and inter-compressed frames are typically called predicted frames, P-frames, or B-frames.
II. Inter Compression in Windows Media Video, Versions 8 and 9
Microsoft Corporation's Windows Media Video, Version 8 [“WMV8”] includes a video encoder and a video decoder. The WMV8 encoder uses intra and inter compression, and the WMV8 decoder uses intra and inter decompression. Windows Media Video, Version 9 [“WMV9”] uses a similar architecture for many operations.
A. Intra Compression
FIG. 1 illustrates block-based intra compression 100 of a block 105 of samples in a key frame in the WMV8 encoder. A block is a set of samples, for example, an 8×8 arrangement of samples. The WMV8 encoder splits a key video frame into 8×8 blocks and applies an 8×8 Discrete Cosine Transform [“DCT”] 110 to individual blocks such as the block 105. A DCT is a type of frequency transform that converts the 8×8 block of samples (spatial information) into an 8×8 block of DCT coefficients 115, which are frequency information. The DCT operation itself is lossless or nearly lossless. Compared to the original sample values, however, the DCT coefficients are more efficient for the encoder to compress since most of the significant information is concentrated in low frequency coefficients (conventionally, the upper left of the block 115) and many of the high frequency coefficients (conventionally, the lower right of the block 115) have values of zero or close to zero.
The encoder then quantizes 120 the DCT coefficients, resulting in an 8×8 block of quantized DCT coefficients 125. Quantization is lossy. Since low frequency DCT coefficients tend to have higher values, quantization results in loss of precision but not complete loss of the information for the coefficients. On the other hand, since high frequency DCT coefficients tend to have values of zero or close to zero, quantization of the high frequency coefficients typically results in contiguous regions of zero values. In addition, in some cases high frequency DCT coefficients are quantized more coarsely than low frequency DCT coefficients, resulting in greater loss of precision/information for the high frequency DCT coefficients.
The encoder then prepares the 8×8 block of quantized DCT coefficients 125 for entropy encoding, which is a form of lossless compression. The exact type of entropy encoding can vary depending on whether a coefficient is a DC coefficient (lowest frequency), an AC coefficient (other frequencies) in the top row or left column, or another AC coefficient.
The encoder encodes the DC coefficient 126 as a differential from the DC coefficient 136 of a neighboring 8×8 block, which is a previously encoded neighbor (e.g., top or left) of the block being encoded. (FIG. 1 shows a neighbor block 135 that is situated to the left of the block being encoded in the frame.) The encoder entropy encodes 140 the differential.
The entropy encoder can encode the left column or top row of AC coefficients as a differential from a corresponding left column or top row of the neighboring 8×8 block. This is an example of AC coefficient prediction. FIG. 1 shows the left column 127 of AC coefficients encoded as a differential 147 from the left column 137 of the neighboring (in reality, to the left) block 135. The differential coding increases the chance that the differential coefficients have zero values. The remaining AC coefficients are from the block 125 of quantized DCT coefficients.
The encoder scans 150 the 8×8 block 145 of quantized AC DCT coefficients into a one-dimensional array 155 and then entropy encodes the scanned AC coefficients using a variation of run length coding 160. The encoder selects an entropy code from one or more run/level/last tables 165 and outputs the entropy code.
B. Inter Compression
Inter compression uses block-based motion compensated prediction coding followed by transform coding of the residual error. FIG. 2 illustrates motion estimation for a predicted frame 210.
In FIG. 2, the encoder computes a motion vector for a macroblock 215 in the predicted frame 210. To compute the motion vector, the encoder searches in a search area 235 of a reference frame 230. Within the search area 235, the encoder compares the macroblock 215 from the predicted frame 210 to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector (entropy coded) for the matching macroblock. The motion vector is differentially coded with respect to a motion vector predictor.
After reconstructing the motion vector by adding the differential to the motion vector predictor, a decoder uses the motion vector to compute a prediction macroblock for the macroblock 215 using information from the reference frame 230, which is a previously reconstructed frame available at the encoder and the decoder.
The prediction is rarely perfect, so the encoder usually encodes blocks of pixel differences (also called the error or residual blocks) between the prediction macroblock and the macroblock 215 itself. The encoder encodes the residual blocks by performing a DCT on the residual blocks, quantizing the DCT coefficients and entropy encoding the quantized DCT coefficients.
III. Lossy Compression and Quantization
The preceding section mentioned quantization, a mechanism for lossy compression, and entropy coding, also called lossless compression. Lossless compression reduces the bit rate of information by removing redundancy from the information without any reduction in fidelity. For example, a series of ten consecutive pixels that are all exactly the same shade of red could be represented as a code for the particular shade of red and the number ten as a “run length” of consecutive pixels, and this series can be perfectly reconstructed by decompression from the code for the shade of red and the indicated number (ten) of consecutive pixels having that shade of red. Lossless compression techniques reduce bit rate at no cost to quality, but can only reduce bit rate up to a certain point. Decreases in bit rate are limited by the inherent amount of variability in the statistical characterization of the input data, which is referred to as the source entropy.
In contrast, with lossy compression, the quality suffers somewhat but the achievable decrease in bit rate is more dramatic. For example, a series of ten pixels, each being a slightly different shade of red, can be approximated as ten pixels with exactly the same particular approximate red color. Lossy compression techniques can be used to reduce bit rate more than lossless compression techniques, but some of the reduction in bit rate is achieved by reducing quality, and the lost quality cannot be completely recovered. Lossy compression is often used in conjunction with lossless compression—in a system design in which the lossy compression establishes an approximation of the information and lossless compression techniques are applied to represent the approximation. For example, the series of ten pixels, each a slightly different shade of red, can be represented as a code for one particular shade of red and the number ten as a run-length of consecutive pixels. In general, an encoder varies quantization to trade off quality and bit rate. Coarser quantization results in greater quality reduction but allows for greater bit rate reduction. In decompression, the original series would then be reconstructed as ten pixels with the same approximated red color.
According to one possible definition, quantization is a term used for an approximating non-reversible mapping function commonly used for lossy compression, in which there is a specified set of possible output values, and each member of the set of possible output values has an associated set of input values that result in the selection of that particular output value. A variety of quantization techniques have been developed, including scalar or vector, uniform or non-uniform, and adaptive or non-adaptive quantization.
A. Scalar Quantizers
According to one possible definition, a scalar quantizer is an approximating functional mapping x→Q[x] of an input value x to a quantized value Q[x]. FIG. 3 shows a “staircase” I/O function (300) for a scalar quantizer. The horizontal axis is a number line for a real number input variable x, and the vertical axis indicates the corresponding quantized values Q[x]. The number line is partitioned by thresholds such as the threshold (310). Each value of x within a given range between a pair of adjacent thresholds is assigned the same quantized value Q[x]. For example, each value of x within the range (320) is assigned the same quantized value (330). (At a threshold, one of the two possible quantized values is assigned to an input x, depending on the system.) Overall, the quantized values Q[x] exhibit a discontinuous, staircase pattern. The distance the mapping continues along the number line depends on the system, typically ending after a finite number of thresholds. The placement of the thresholds on the number line may be uniformly spaced (as shown in FIG. 3) or non-uniformly spaced.
A scalar quantizer can be decomposed into two distinct stages. The first stage is the classifier stage, in which a classifier function mapping x→A[x] maps an input x to a quantization index A[x], which is often integer-valued. In essence, the classifier segments an input number line or data set. FIG. 4a shows a generalized classifier (400) and thresholds for a scalar quantizer. As in FIG. 3, a number line for a real number variable x is segmented by thresholds such as the threshold (410). Each value of x within a given range such as the range (420) is assigned the same quantized value Q[x]. FIG. 4b shows a numerical example of a classifier (450) and thresholds for a scalar quantizer.
In the second stage, a reconstructor functional mapping k→β[k] maps each quantization index k to a reconstruction value β[k]. In essence, the reconstructor places steps having a particular height relative to the input number line segments (or selects a subset of data set values) for reconstruction of each region determined by the classifier. The reconstructor functional mapping may be implemented, for example, using a lookup table. Overall, the classifier relates to the reconstructor as follows:Q[x]=β[A[x]]  (1).
In common usage, the term “quantization” is often used to describe the classifier stage, which is performed during encoding. The term “inverse quantization” is similarly used to describe the reconstructor stage, whether performed during encoding or decoding.
The distortion introduced by using such a quantizer may be computed with a difference-based distortion measure d(x−Q[x]). Typically, such a distortion measure has the property that d(x−Q[x]) increases as x−Q[x] deviates from zero; and typically each reconstruction value lies within the range of the corresponding classification region, so that the straight line that would be formed by the functional equation Q[x]=x will pass through every step of the staircase diagram (as shown in FIG. 3) and therefore Q[Q[x]] will typically be equal to Q[x]. In general, a quantizer is considered better in rate-distortion terms if the quantizer results in a lower average value of distortion than other quantizers for a given bit rate of output. More formally, a quantizer is considered better if, for a source random variable X, the expected (i.e., the average or statistical mean) value of the distortion measure D=EX{d(X−Q[X])} is lower for an equal or lower entropy H of A[X]. The most commonly-used distortion measure is the squared error distortion measure, for which d(|x−y|)=|x−y|2. When the squared error distortion measure is used, the expected value of the distortion measure ( D) is referred to as the mean squared error.
B. Non-uniform Quantizers
A non-uniform quantizer has threshold values that are not uniformly spaced for all classifier regions. According to one possible definition, a dead zone plus uniform threshold quantizer [“DZ+UTQ”] is a quantizer with uniformly spaced threshold values for all classifier regions except the one containing the zero input value (which is called the dead zone [“DZ”]). In a general sense, a DZ+UTQ is a non-uniform quantizer, since the DZ size is different than the other classifier regions.
A DZ+UTQ has a classifier index mapping rule x→A[x] that can be expressed based on two parameters. FIG. 5 shows a staircase I/O function (500) for a DZ+UTQ, and FIG. 6a shows a generalized classifier (600) and thresholds for a DZ+UTQ. The parameter s, which is greater than 0, indicates the step size for all steps other than the DZ. Mathematically, all si are equal to s for i≠0. The parameter z, which is greater than or equal to 0, indicates the ratio of the DZ size to the size of the other steps. Mathematically, s0=z·s. In FIG. 6a, z is 2, so the DZ is twice as wide as the other classification zones. The index mapping rule x→A[x] for a DZ+UTQ can be expressed as:
                                          A            ⁡                          [              x              ]                                =                                    sign              ⁡                              (                x                )                                      *                          max              ⁡                              (                                  0                  ,                                      ⌊                                                                                                                      x                                                                          s                                            -                                              z                        2                                            +                      1                                        ⌋                                                  )                                                    ,                            (        2        )            where └·┘ denotes the smallest integer less than or equal to the argument and where sign(x) is the function defined as:
                              sign          ⁡                      (            X            )                          =                  {                                                                                          +                    1                                    ,                                                                                                                        for                      ⁢                                                                                          ⁢                      x                                        ≥                    0                                    ,                                                                                                                          -                    1                                    ,                                                                                                  for                    ⁢                                                                                  ⁢                    x                                    <                  0.                                                                                        (        3        )            
FIG. 6b shows a numerical example of a classifier (650) and thresholds for a DZ+UTQ with s=1 and z=2. FIGS. 3, 4a, and 4b show a special case DZ+UTQ with z=1. Quantizers of the UTQ form have good performance for a variety of statistical sources. In particular, the DZ+UTQ form is optimal for the statistical random variable source known as the Laplacian source.
C. Reconstruction Rules
Different reconstruction rules may be used to determine the reconstruction value for each quantization index. Standards and product specifications that focus only on achieving interoperability will often specify reconstruction values without necessarily specifying the classification rule. In other words, some specifications may define the functional mapping k→β[k] without defining the functional mapping x→A[x]. This allows a decoder built to comply with the standard/specification to reconstruct information correctly. In contrast, encoders are often given the freedom to change the classifier in any way that they wish, while still complying with the standard/specification.
Numerous systems for adjusting quantization thresholds have been developed. Many standards and products specify reconstruction values that correspond to a typical mid-point reconstruction rule (e.g., for a typical simple classification rule) for the sake of simplicity. For classification, however, the thresholds can in fact be adjusted so that certain input values will be mapped to more common (and hence, lower bit rate) indices, which makes the reconstruction values closer to optimal.
In many systems, the extent of quantization is measured in terms of quantization step size. Coarser quantization uses larger quantization step sizes, corresponding to wider ranges of input values. Finer quantization uses smaller quantization step sizes. Often, for purposes of signaling and reconstruction, quantization step sizes are parameterized as multiples of a smallest quantization step size.
D. Perceptual Effects of Quantization
As mentioned above, lossy compression tends to cause a decrease in quality. For example, a series of ten samples of slightly different values can be approximated using quantization as ten samples with exactly the same particular approximate value. This kind of quantization can reduce the bit rate of encoding the series of ten samples, but at the cost of lost detail in the original ten samples. In some cases, quantization also can produce visible artifacts that tend to be more artificial-looking and visually distracting than simple loss of fine detail. For example, smooth, un-textured content is susceptible to contouring artifacts—artifacts that appear between regions of two different quantization output values—because the human visual system is sensitive to subtle differences (particularly luma differences) between adjacent areas of flat color.
Another perceptual effect of quantization occurs when average quantization step sizes are varied between frames in a sequence. Although the flexibility to change quantization step sizes can help control bit rate, an unpleasant “flicker” effect can occur when average quantization step sizes vary too much from frame to frame and the difference in quality between frames becomes noticeable.
IV. Signaling Quantization Parameters in VC-1
In some systems, an encoder can use different quantizers and different quantization step size parameters (“QPs”) for different sequences, different frames, and different parts of frames.
For example, a VC-1 encoder specifies a quantizer used for a video sequence. The encoder sends a 2-bit bitstream element (“QUANTIZER”) at sequence level in a bitstream syntax to indicate a quantizer type for the sequence. QUANTIZER indicates that the quantizer for the sequence is specified as being uniform or non-uniform at frame level, that the encoder uses a non-uniform quantizer for all frames, or that the encoder uses a uniform quantizer for all frames. Whether the encoder uses a uniform quantizer or non-uniform quantizer, the encoder sends a frame-level bitstream element, PQINDEX, to indicate a default frame QP (“PQUANT”). If QUANTIZER indicates an implicitly specified quantizer, PQINDEX also indicates whether the quantizer used is uniform or non-uniform. If QUANTIZER indicates an explicitly specified quantizer, the frame-level bitstream element PQUANTIZER is sent to indicate whether the quantizer for the frame is uniform or non-uniform. PQINDEX is present, and PQUANTIZER is present if required, in all frame types.
Table 2 shows how PQINDEX is translated to PQUANT for the case where QUANTIZER=0 (indicating the quantizer is implicit and hence specified by PQINDEX).
TABLE 2Implicit quantizer translation in a WMV encoderQuan-PQINDEXPQUANTtizerPQINDEXPQUANTQuantizer0ReservedNA1613Non-uniform11Uni-1714Non-formuniform22Uni-1815Non-formuniform33Uni-1916Non-formuniform44Uni-2017Non-formuniform55Uni-2118Non-formuniform66Uni-2219Non-formuniform77Uni-2320Non-formuniform88Uni-2421Non-formuniform96Non-2522Non-uni-uniformform107Non-2623Non-uni-uniformform118Non-2724Non-uni-uniformform129Non-2825Non-uni-uniformform1310Non-2927Non-uni-uniformform1411Non-3029Non-uni-uniformform1512Non-3131Non-uni-uniformform
If the quantizer is signaled explicitly at the sequence or frame level (signaled by syntax element QUANTIZER=01, 10 or 11), then PQUANT is equal to PQINDEX for all nonzero values of PQINDEX.
V. Other Standards and Products
Numerous international standards specify aspects of video decoders and formats for compressed video information. Directly or by implication, these standards also specify certain encoder details, but other encoder details are not specified. Some standards address still image compression/decompression, and other standards address audio compression/decompression. Numerous companies have produced encoders and decoders for audio, still images, and video. Various other kinds of signals (for example, hyperspectral imagery, graphics, text, financial information, etc.) are also commonly represented and stored or transmitted using compression techniques.
Standards typically do not fully specify the quantizer design. Most allow some variation in the encoder classification rule x→A[x] and/or the decoder reconstruction rule k→β[k].
The use of a DZ ratio z=2 or greater has been implicit in a number of encoding designs. For example, the spacing of reconstruction values for predicted regions in some standards implies use of z>2. Reconstruction values in these examples from standards are spaced appropriately for use of DZ+UTQ classification with z=2 and mid-point reconstruction. Altering thresholds to increase optimality for the specified reconstruction values (as described above) results in an even larger DZ ratio (since the DZ requires fewer bits to select than the other levels).
Designs based on z=1 (or at least z<2) have been used for quantization in several standards. In these cases, reconstruction values are equally spaced around zero and away from zero.
Given the critical importance of video compression to digital video, it is not surprising that video compression is a richly developed field. Whatever the benefits of previous video compression techniques, however, they do not have the advantages of the following techniques and tools.