Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. Thus, the number of bits per second, or bit rate, of a raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
A basic goal of lossy compression is to provide good rate-distortion performance. So, for a particular bit rate, an encoder attempts to provide the highest quality of video. Or, for a particular level of quality/fidelity to the original video, an encoder attempts to provide the lowest bit rate encoded video. In practice, considerations such as encoding time, encoding complexity, encoding resources, decoding time, decoding complexity, decoding resources, overall delay, and/or smoothness in quality/bit rate changes also affect decisions made in codec design as well as decisions made during actual encoding.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression. Intra-picture compression techniques compress individual pictures, and inter-picture compression techniques compress pictures with reference to a preceding and/or following picture (often called a reference or anchor picture) or pictures.
Inter-picture compression techniques often use motion estimation and motion compensation to reduce bit rate by exploiting temporal redundancy in a video sequence. Motion estimation is a process for estimating motion between pictures. In one common technique, an encoder using motion estimation attempts to match a current block of samples in a current picture with a candidate block of the same size in a search area in another picture, the reference picture. When the encoder finds an exact or “close enough” match in the search area in the reference picture, the encoder parameterizes the change in position between the current and candidate blocks as motion data (such as a motion vector (“MV”)). A motion vector is conventionally a two-dimensional value, having a horizontal component that indicates left or right spatial displacement and a vertical component that indicates up or down spatial displacement. In general, motion compensation is a process of reconstructing pictures from reference picture(s) using motion data.
FIG. 1 illustrates motion estimation for part of a predicted picture in an example encoder. For an 8×8 block of samples, 16×16 block (often called a “macroblock”), or other unit of the current picture, the encoder finds a similar unit in a reference picture for use as a predictor. In FIG. 1, the encoder computes a motion vector for a 16×16 macroblock (115) in the current, predicted picture (110). The encoder searches in a search area (135) of a reference picture (130). Within the search area (135), the encoder compares the macroblock (115) from the predicted picture (110) to various candidate macroblocks in order to find a candidate macroblock that is a good match. The encoder outputs information specifying the motion vector to the predictor macroblock.
The encoder computes the sample-by-sample difference between the current unit and its motion-compensated prediction to determine a residual (also called error signal). The residual is frequency transformed, quantized, and entropy encoded. As a linear energy-compacting transform, the frequency transform tends to produce transform coefficients with energy concentrated in lower frequency coefficients. The overall bit rate of a predicted picture depends in large part on the bit rate of residuals. The bit rate of residuals is low if the residuals are simple (i.e., due to motion estimation that finds exact or good matches) or lossy compression drastically reduces the complexity of the residuals. Bits saved with successful motion estimation can be used to improve quality elsewhere or reduce overall bit rate. On the other hand, the bit rate of complex residuals can be higher, depending on the degree of lossy compression applied to reduce the complexity of the residuals.
If a predicted picture is used as a reference picture for subsequent motion compensation, the encoder reconstructs the predicted picture. When reconstructing residuals, the encoder reconstructs transform coefficients that were quantized using inverse quantization and performs an inverse frequency transform. The encoder performs motion compensation to compute the motion-compensated predictors, and combines the predictors with the reconstructed residuals.
Encoders typically spend a large proportion of encoding time performing motion estimation, attempting to find good matches and thereby improve rate-distortion performance. Generally, using a large search range in a reference picture improves the chances of an encoder finding a good match. The encoder potentially compares a current block against all possible spatially displaced blocks in the large search range, however. In most scenarios, an encoder lacks the time or resources to check every possible motion vector in a large search range for every block or macroblock to be encoded. In particular, when a codec allows motion vectors for large displacements, the computational cost of searching through a large search range for the best motion vector can be prohibitive, especially when the content to be encoded is high definition video. Various techniques help encoders speed up motion estimation.
With one type of technique, a user setting, profile setting, or level setting directly sets motion vector range to be a particular size. Motion vector range indicates the allowed sizes of motion vectors. For an encoder that otherwise performs a full search across a reference picture, the motion vector range in effect constrains the search range by excluding motion vectors outside the motion vector range. A user sets the motion vector range with a command-line parameter, user interface control, etc., to over-ride a default value. For example, for high-quality, off-line encoding, a large motion vector range (and hence large search range) is used. Or, for lower-quality, real-time encoding, a smaller motion vector range (and hence smaller search range) is used. While these settings address concerns about encoding time and resources, they are inflexible in that they do not adapt motion vector range or search range to changes in motion characteristics of the video content being encoded. As a result, in some scenarios, a large motion vector range and search range are unneeded for a series of low-motion pictures. Or, a small motion vector range and search range are inadequate for a series of high-motion pictures.
In hierarchical motion estimation, an encoder finds one or more motion vectors at a low resolution (e.g., using 4:1 downsampled pictures), scales up the motion vector(s) to a higher resolution (e.g., integer-pixel), finds one or more motion vectors at the higher resolution in neighborhood(s) around the scaled up motion vector(s), and so on. While this allows the encoder to skip exhaustive searches at the higher resolutions, it can result in wasteful long searches at the low resolution when there is little or no motion to justify such searches. Such hierarchical motion estimation also fails to adapt motion vector range and search range to changes in motion characteristics in the video content being encoded.
Other encoders dynamically adjust search range when performing motion estimation for a current block or macroblock of a picture by considering the motion vectors of immediately spatially adjacent blocks in the same picture. Such encoders dramatically speed up motion estimation by tightly focusing the motion vector search process for the current block or macroblock. However, in certain scenarios (e.g., strong localized motion, discontinuous motion or other complex motion), such motion estimation can fail to provide adequate performance.
In general, encoders use a distortion metric during motion estimation. A distortion metric helps an encoder evaluate the quality and rate costs associated with using a candidate block in a motion estimation choice.
One common distortion metric is sum of absolute differences (“SAD”). To compute the SAD for a candidate block in a reference picture, the encoder computes the sum of the absolute values of the residual between the current and candidate blocks, where the residual is the sample-by-sample difference between the current block and the candidate block. Low computational complexity is an advantage of SAD. SAD is a poor approximation of overall rate-distortion cost in some cases, however. In particular, when there are large but uniform sample differences between the current block and the candidate block, SAD poorly approximates actual distortion. SAD fails to account for the energy-compacting effects of the frequency transforms performed on residuals. Suppose a current block has significant but uniform differences in sample values compared to a candidate block. Most likely, a frequency transform during encoding will capture and isolate the uniform sample differences in a non-zero DC coefficient value. (The DC coefficient is the lowest frequency transform coefficient.) Because of the energy compaction effects, the overall rate-distortion cost of choosing the candidate block is likely small. SAD may incorrectly indicate a large cost, however.
Some video encoders therefore use sum of absolute Hadamard-transformed differences (“SAHD”) as a distortion metric or use another sum of absolute transformed differences (“SATD”) metric. To compute the SAHD for a candidate block in a reference picture, an encoder Hadamard transforms the current block and Hadamard transforms the candidate block, then computes the sum of the absolute values of the differences between the Hadamard-transformed blocks. Or, the encoder computes a residual, Hadamard transforms the residual, and computes the sum of absolute values of the Hadamard-transformed residual. The frequency transform used later in compression is often not a Hadamard transform. Rather, the Hadamard transform approximates the energy compaction of the frequency transform that the encoder later uses on residuals, but the Hadamard transform is simpler to compute. Using SAHD in motion estimation often results in better rate-distortion performance than using SAD, as SAHD accounts for uniform overall sample value shifts, but using SAHD also increases computational complexity. A single Hadamard transform is relatively simple, but performing a Hadamard transform when computing a distortion metric greatly increases the aggregate computational complexity of motion estimation, since encoders typically spend such a large proportion of encoding time evaluating different candidate blocks during motion estimation.
Sum of squared errors (“SSE”), mean squared error (“MSE”), and mean variance are other distortion metrics. With SSE, an encoder squares the values of a residual then sums the squared values. With MSE, an encoder computes the mean of the squared values. One definition of mean variance is:
            1      I        ⁢                  ∑        i            ⁢                          ⁢                        (                                    ϰ              i              r                        -                                          ϰ                _                            i              r                                )                2              ,where xir is the mean of the I residual values in the residual xir. Mean variance to some extent accounts for overall differences between a current block and candidate block. SSE, MSE and mean variance yield acceptable rate-distortion performance in some cases, but increase the computational complexity of measuring distortion.
Some encoders compute rate-distortion cost as a distortion metric during motion estimation. A rate-distortion cost has a distortion term and a rate term, with a factor (often called a Lagrangian multiplier) scaling the rate term relative to the distortion term. The rate term can be an estimated or actual bit rate cost for motion vector information and/or residual information. The distortion term can be based upon a comparison (e.g., SAD) of original samples to reconstructed samples (samples reconstructed following a frequency transform, quantization, inverse quantization, and an inverse frequency transform). Or, the distortion term can be some other distortion measure or estimate. Rate-distortion cost usually provides the most accurate assessment of rate-distortion performance of different motion estimation choices, but also has the highest computational complexity, especially if different quantization parameters are evaluated for each different motion estimation choice.
In most cases, an encoder uses the same distortion metric (e.g., only SAD, only SAHD) throughout motion estimation. This is inflexible and, depending on the metric used, can be computationally inefficient or result in poor rate-distortion performance.
Another approach is to use SAD to find the top x candidate motion vectors in motion estimation, then use rate-distortion cost to evaluate each of the top x candidate motion vectors. For example, the top 3 candidates are evaluated with rate-distortion cost. While this avoids the computational cost of using rate-distortion cost from the start of motion estimation, in some cases the encoder misses good candidates due to deficiencies of SAD, and settles instead on inferior candidates. If an encoder uses SAHD at the start, followed by rate-distortion cost on the top x candidates, the encoder is more likely to find good candidates, but computational complexity is dramatically increased.
In still another approach, an encoder uses SAD at an integer-pixel stage of hierarchical motion estimation and uses SAHD at ½-pixel and ¼-pixel stages of the hierarchical motion estimation. Again, while this avoids the computational cost of using SAHD from the start of motion estimation, in many cases the encoder misses good candidates due to deficiencies of SAD.
Aside from these techniques, many encoders use specialized motion vector search patterns or other strategies deemed likely to find a good match in an acceptable amount of time. Various other techniques for speeding up or otherwise improving motion estimation have been developed. Given the critical importance of video compression to digital video, it is not surprising that motion estimation is a richly developed field. Whatever the benefits of previous motion estimation techniques, however, they do not have the advantages of the following techniques and tools.