Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
A basic goal of lossy compression is to provide good rate-distortion performance. So, for a particular bit rate, an encoder attempts to provide the highest quality of video. Or, for a particular level of quality/fidelity to the original video, an encoder attempts to provide the lowest bit rate encoded video. In practice, considerations such as encoding time, encoding complexity, encoding resources, decoding time, decoding complexity, decoding resources, overall delay, and/or smoothness in quality/bit rate changes also affect decisions made in codec design as well as decisions made during actual encoding.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression. Intra-picture compression techniques compress individual pictures, and inter-picture compression techniques compress pictures with reference to a preceding and/or following picture (often called a reference or anchor picture) or pictures.
I. Intra Compression
FIG. 1 illustrates block-based intra compression in an example encoder. In particular, FIG. 1 illustrates intra compression of an 8×8 block (105) of samples by the encoder. The encoder splits a picture into 8×8 blocks of samples and applies a forward 8×8 frequency transform (110) (such as a discrete cosine transform (“DCT”)) to individual blocks such as the block (105). The encoder quantizes (120) the transform coefficients (115), resulting in an 8×8 block of quantized transform coefficients (125).
With quantization, the encoder essentially trades off quality and bit rate. More specifically, quantization can affect the fidelity with which the transform coefficients are encoded, which in turn can affect bit rate. Coarser quantization tends to decrease fidelity to the original transform coefficients as the coefficients are more coarsely approximated. Bit rate also decreases, however, when decreased complexity can be exploited with lossless compression. Conversely, finer quantization tends to preserve fidelity and quality but result in higher bit rates.
Different encoders use different parameters for quantization. In most encoders, a level or step size of quantization is set for a block, picture, or other unit of video. In some encoders, the encoder can also adjust the “dead zone,” which is the range of values around zero that are approximated as zero. Some encoders quantize coefficients differently within a given block, so as to apply relatively coarser quantization to perceptually less important coefficients, and a quantization matrix can be used to indicate the relative weights. Or, apart from the rules used to reconstruct quantized values, some encoders vary the thresholds according to which values are quantized so as to quantize certain values more aggressively than others.
Returning to FIG. 1, further encoding varies depending on whether a coefficient is a DC coefficient (the lowest frequency coefficient shown as the top left coefficient in the block (125)), an AC coefficient in the top row or left column in the block (125), or another AC coefficient. The encoder typically encodes the DC coefficient (126) as a differential from the reconstructed DC coefficient (136) of a neighboring 8×8 block. The encoder entropy encodes (140) the differential. The entropy encoder can encode the left column or top row of AC coefficients as differentials from AC coefficients a corresponding left column or top row of a neighboring 8×8 block. The encoder scans (150) the 8×8 block (145) of predicted, quantized AC coefficients into a one-dimensional array (155). The encoder then entropy encodes the scanned coefficients using a variation of run/level coding (160).
In corresponding decoding, a decoder produces a reconstructed version of the original 8×8 block. The decoder entropy decodes the quantized transform coefficients, scanning the quantized coefficients into a two-dimensional block, and performing AC prediction and/or DC prediction as needed. The decoder inverse quantizes the quantized transform coefficients of the block and applies an inverse frequency transform (such as an inverse DCT (“IDCT”)) to the de-quantized transform coefficients, producing the reconstructed version of the original 8×8 block. When a picture is used as a reference picture in subsequent motion compensation (see below), an encoder also reconstructs the picture.
II. Inter Compression
Inter-picture compression techniques often use motion estimation and motion compensation to reduce bit rate by exploiting temporal redundancy in a video sequence. Motion estimation is a process for estimating motion between pictures. In one common technique, an encoder using motion estimation attempts to match a block of samples in a current picture with a block of samples in a search area in another picture, called the reference picture. When the encoder finds an exact or “close enough” match in the search area in the reference picture, the encoder parameterizes the change in position of the blocks as motion data (such as a motion vector). In general, motion compensation is a process of reconstructing pictures from reference picture(s) using motion data.
FIG. 2 illustrates motion estimation for part of a predicted picture in an example encoder. For an 8×8 block of samples, 16×16 block (often called a “macroblock”), or other unit of the current picture, the encoder finds a similar unit in a reference picture for use as a predictor. In FIG. 2, the encoder computes a motion vector for a 16×16 macroblock (215) in the current, predicted picture (210). The encoder searches in a search area (235) of a reference picture (230). Within the search area (235), the encoder compares the macroblock (215) from the predicted picture (210) to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector to the predictor macroblock.
The encoder computes the sample-by-sample difference between the current unit and the predictor to determine a residual (also called error signal). The residual is frequency transformed, quantized, and entropy encoded. The overall bit rate of a predicted picture depends in large part on the bit rate of residuals. The bit rate of residuals is low if the residuals are simple (i.e., due to motion estimation that finds exact or good matches) or lossy compression drastically reduces the complexity of the residuals. Bits saved with successful motion estimation can be used to improve quality elsewhere or reduce overall bit rate. On the other hand, the bit rate of complex residuals can be higher, depending on the degree of lossy compression applied to reduce the complexity of the residuals.
Encoders typically spend a large proportion of encoding time performing motion estimation, attempting to find good matches and thereby improve rate-distortion performance. In most scenarios, however, an encoder lacks the time or resources to check every possible motion vector for every block or macroblock to be encoded. The encoder therefore uses motion vector search patterns and matching heuristics deemed likely to find a good match in an acceptable amount of time.
The number of motion vectors used to represent a picture can also affect rate-distortion performance. Using four motion vectors for four different 8×8 blocks of a 16×16 macroblock (instead of one motion vector for the macroblock) allows an encoder to capture different motion for the different blocks, potentially resulting in better matches. On the other hand, motion vector information for four motion vectors (instead of one) is signaled, increasing bit rate of motion data.
FIG. 3 illustrates compression of a prediction residual for a motion-compensated block of a predicted picture in an example encoder. The encoder computes an 8×8 prediction error block (335) as the difference between a predicted block (315) and a current 8×8 block (325).
The encoder applies a frequency transform (340) to the residual (335), producing a block of transform coefficients (345). Some encoders switch between different sizes of transforms, e.g., an 8×8 transform, two 4×8 transforms, two 8×4 transforms, or four 4×4 transforms for an 8×8 prediction residual block. Smaller transform sizes allow for greater isolation of transform coefficients having non-zero values, but generally require more signaling overhead. FIG. 3 shows the encoder using one 8×8 transform.
The encoder quantizes (350) the transform coefficients (345) and scans (360) the quantized coefficients (355) into a one-dimensional array (365) such that coefficients are generally ordered from lowest frequency to highest frequency. The encoder entropy codes the data in the array (365).
If a predicted picture is used as a reference picture for subsequent motion compensation, the encoder reconstructs the predicted picture. When reconstructing residuals, the encoder reconstructs transform coefficients that were quantized and performs an inverse frequency transform. The encoder performs motion compensation to compute the motion-compensated predictors, and combines the predictors with the residuals. During decoding, a decoder typically entropy decodes information and performs analogous operations to reconstruct residuals, perform motion compensation, and combine the predictors with the reconstructed residuals.
III. Computing Pixel-domain Distortion When Making Encoding Decisions
The previous two sections mention some of the decisions that an encoder can make during encoding. When encoding a block of a predicted picture, an encoder can evaluate and set a number of coding parameters, including: (1) whether the block should be encoded as an intra or inter; (2) the number of motion vectors; (3) the value(s) of motion vector(s); (4) the type of frequency transform; (5) the size of frequency transform (e.g., 8×8, 4×8, 8×4, or 4×4); (6) the quantization step size; (7) the quantization thresholds to apply; (8) the dead zone size; and (9) the quantization matrix. Or, for a block of an intra-coded picture, the encoder can evaluate and set various quantization-related parameters. Depending on implementation, an encoder may finalize certain parameter decisions before starting to evaluate other parameters. Or, the encoder may jointly explore different combinations of coding parameters, which makes the decision-making process even more complex given the number of permutations to evaluate.
In making encoding decisions, an encoder often evaluates the distortion and rate associated with the different choices. In particular, for a block to be encoded, pixel-domain distortion of the block encoded according to different coding choices is an important criterion in encoder mode decisions. There are several approaches to determining pixel-domain distortion.
In one approach, an encoder performs inverse quantization to reconstruct transform coefficients for a block and performs an inverse frequency transform on the de-quantized transform coefficients. The encoder directly measures pixel-domain distortion by comparing the reconstructed pixel-domain values for the block to the original pixel-domain values for the block. While this approach yields accurate pixel-domain distortion measurements, it is expensive in terms of encoding time and resources. Performing an inverse frequency transform for every evaluated coding choice greatly increases the computational complexity of the encoding task. As a result, encoding time increases or more encoding resources are required. Or, to handle practical time or resource constraints, an encoder evaluates fewer coding options, which can result in the encoder missing efficient options.
In another approach, an encoder performs inverse quantization to reconstruct transform coefficients for a block but measures distortion in the transform domain. The encoder measures transform-domain distortion by comparing the de-quantized transform coefficients for the block to the original transform coefficients for the block. To estimate pixel-domain distortion for the block, the encoder can multiply the transform-domain distortion by a scale factor that depends on the frequency transform used. If the transform is orthogonal, the encoder multiplies the transform-domain distortion by a non-zero scale factor so that the energy in the transform domain is roughly equivalent to the energy in the pixel domain. In this approach, the encoder does not perform an inverse frequency transform for every evaluated coding choice, so computational complexity is lowered. The pixel-domain distortion estimated by this approach is often inaccurate, however, particularly when only the DC coefficient of a block has a significant value. This inaccuracy in pixel-domain distortion estimation can lead to inefficient choices of coding parameters and poor rate-distortion performance.
Given the critical importance of video compression to digital video, it is not surprising that video compression is a richly developed field. Whatever the benefits of previous video compression techniques, however, they do not have the advantages of the following techniques and tools.