Companies and consumers increasingly depend on computers to process, distribute, and play back high quality video content. Engineers use compression (also called source coding or source encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video information by converting the information into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called source entropy) of the input video data. Or, compression can be lossy, in which the quality of the video suffers and the lost quality cannot be completely recovered, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—lossy compression establishes an approximation of information, and the lossless compression is applied to a representation of the approximation.
In general, video compression techniques include “intra-picture” (sometimes called “intra-frame” or simply “intra”) compression and “inter-picture” (sometimes called “inter-frame” or simply “inter”) compression. Intra-picture compression techniques compress a picture with reference to information within the picture, and inter-picture compression techniques compress a picture with reference to a preceding and/or following picture or pictures (often called reference or anchor pictures).
For intra-picture compression, for example, an encoder splits a picture into 8×8 blocks of samples, where a sample is a number that represents the intensity of brightness or the intensity of a color component for a small, elementary region of the picture, and the samples of the picture are organized as arrays or planes. The encoder applies a frequency transform to individual blocks. The frequency transform converts an 8×8 block of samples into an 8×8 block of transform coefficients. The encoder quantizes the transform coefficients, which may result in lossy compression. For lossless compression, the encoder entropy codes the quantized transform coefficients.
Inter-picture compression techniques often use motion estimation and motion compensation to reduce bit rate by exploiting temporal redundancy in a video sequence. Motion estimation is a process for estimating motion between pictures. For example, for an 8×8 block of samples or other unit of the current picture, the encoder attempts to find a match of the same size in a search area in another picture, the reference picture. Within the search area, the encoder compares the current unit to various candidates in order to find a candidate that is a good match. When the encoder finds an exact or “close enough” match, the encoder parameterizes the change in position between the current and candidate units as motion data (such as a motion vector (“MV”)). In general, motion compensation is a process of reconstructing pictures from reference picture(s) using motion data.
The example encoder also computes the sample-by-sample difference between the original current unit and its motion-compensated prediction to determine a residual (also called a prediction residual or error signal). The encoder then applies a frequency transform to the residual, resulting in transform coefficients. The encoder quantizes the transform coefficients and entropy codes the quantized transform coefficients.
If an intra-compressed picture or motion-predicted picture is used as a reference picture for subsequent motion compensation, the encoder reconstructs the picture. A decoder also reconstructs pictures during decoding, and it uses some of the reconstructed pictures as reference pictures in motion compensation. For example, for an 8×8 block of samples of an intra-compressed picture, an example decoder reconstructs a block of quantized transform coefficients. The example decoder and encoder perform inverse quantization and an inverse frequency transform to produce a reconstructed version of the original 8×8 block of samples.
As another example, the example decoder or encoder reconstructs an 8×8 block from a prediction residual for the block. The decoder decodes entropy-coded information representing the prediction residual. The decoder/encoder inverse quantizes and inverse frequency transforms the data, resulting in a reconstructed residual. In a separate motion compensation path, the decoder/encoder computes an 8×8 predicted block using motion vector information for displacement from a reference picture. The decoder/encoder then combines the predicted block with the reconstructed residual to form the reconstructed 8×8 block.
Quantization and other lossy processing can result in visible lines at boundaries between blocks. This might occur, for example, if adjacent blocks in a smoothly changing region of a picture (such as a sky area in an outdoor scene) are quantized to different average levels. Blocking artifacts can be especially troublesome in reference pictures that are used for motion estimation and compensation. To reduce blocking artifacts, the example encoder and decoder use “deblock” filtering to smooth boundary discontinuities between blocks in reference pictures. The filtering is “in-loop” in that it occurs inside a motion-compensation loop—the encoder and decoder perform it on reference pictures used for subsequent encoding/decoding. Deblock filtering improves the quality of motion estimation/compensation, resulting in better motion-compensated prediction and lower bitrate for prediction residuals. In-loop deblocking filtering is often referred to as “loop filtering.”
I. Organization of Video Frames
In some cases, the example encoder and example decoder process video frames organized as shown in FIG. 1, 2A, 2B and 2C. For progressive video, lines of a video frame contain samples starting from one time instant and continuing through successive lines to the bottom of the frame. An interlaced video frame consists of two scans—one for the even lines of the frame (the top field) and the other for the odd lines of the frame (the bottom field).
A progressive video frame can be divided into 16×16 macroblocks such as the macroblock (100) shown in FIG. 1. The macroblock (100) includes four 8×8 blocks (Y0 through Y3) of luma (or brightness) samples and two 8×8 blocks (Cb, Cr) of chroma (or color component) samples, which are co-located with the four luma blocks but half resolution horizontally and vertically.
FIG. 2A shows part of an interlaced video frame (200), including the alternating lines of the top field and bottom field at the top left part of the interlaced video frame (200). The two fields may represent two different time periods or they may be from the same time period. When the two fields of a frame represent different time periods, this can create jagged tooth-like features in regions of the frame where motion is present.
Therefore, interlaced video frames can be rearranged according to a field structure, with the odd lines grouped together in one field, and the even lines grouped together in another field. This arrangement, known as field coding, is useful in high-motion pictures. FIG. 2C shows the interlaced video frame (200) of FIG. 2A organized for encoding/decoding as fields (260). Each of the two fields of the interlaced video frame (200) is partitioned into macroblocks. The top field is partitioned into macroblocks such as the macroblock (261), and the bottom field is partitioned into macroblocks such as the macroblock (262). (The macroblocks can use a format as shown in FIG. 1, and the organization and placement of luma blocks and chroma blocks within the macroblocks are not shown.) In the luma plane, the macroblock (261) includes 16 lines from the top field, the macroblock (262) includes 16 lines from the bottom field, and each line is 16 samples long.
On the other hand, in stationary regions, image detail in the interlaced video frame may be more efficiently preserved without rearrangement into separate fields. Accordingly, frame coding is often used in stationary or low-motion interlaced video frames. FIG. 2B shows the interlaced video frame (200) of FIG. 2A organized for encoding/decoding as a frame (230). The interlaced video frame (200) has been partitioned into macroblocks such as the macroblocks (231) and (232), which use a format as shown in FIG. 1. In the luma plane, each macroblock (231, 232) includes 8 lines from the top field alternating with 8 lines from the bottom field for 16 lines total, and each line is 16 samples long. (The actual organization and placement of luma blocks and chroma blocks within the macroblocks (231, 232) are not shown, and in fact may vary for different encoding decisions.) Within a given macroblock, the top-field information and bottom-field information may be coded jointly or separately at any of various phases—the macroblock itself may be field coded or frame coded.
II. Acceleration of Video Decoding and Encoding
While some video decoding and encoding operations are relatively simple, others are computationally complex. For example, inverse frequency transforms, fractional sample interpolation operations for motion compensation, in-loop deblock filtering, post-processing filtering, color conversion, and video re-sizing can require extensive computation. This computational complexity can be problematic in various scenarios, such as decoding of high-quality, high-bit rate video (e.g., compressed high-definition video).
Some decoders use video acceleration to offload selected computationally intensive operations to a graphics processor. For example, in some configurations, a computer system includes a primary central processing unit (“CPU”) as well as a graphics processing unit (“GPU”) or other hardware specially adapted for graphics processing. A decoder uses the primary CPU as a host to control overall decoding and uses the GPU to perform simple operations that collectively require extensive computation, accomplishing video acceleration.
FIG. 3 shows a simplified software architecture (300) for video acceleration during video decoding. A video decoder (310) controls overall decoding and performs some decoding operations using a host CPU. The decoder (310) signals control information (e.g., picture parameters, macroblock parameters) and other information to a device driver (330) for a video accelerator (e.g., with GPU) across an acceleration interface (320).
The acceleration interface (320) is exposed to the decoder (310) as an application programming interface (“API”). The device driver (330) associated with the video accelerator is exposed through a device driver interface (“DDI”). In an example interaction, the decoder (310) fills a buffer with instructions and information then calls a method of an interface to alert the device driver (330) through the operating system. The buffered instructions and information, opaque to the operating system, are passed to the device driver (330) by reference, and video information is transferred to GPU memory if appropriate. While a particular implementation of the API and DDI may be tailored to a particular operating system or platform, in some cases, the API and/or DDI can be implemented for multiple different operating systems or platforms.
In some cases, the data structures and protocol used to parameterize acceleration information are conceptually separate from the mechanisms used to convey the information. In order to impose consistency in the format, organization and timing of the information passed between the decoder (310) and device driver (330), an interface specification can define a protocol for instructions and information for decoding according to a particular video decoding standard or product. The decoder (310) follows specified conventions when putting instructions and information in a buffer. The device driver (330) retrieves the buffered instructions and information according to the specified conventions and performs decoding appropriate to the standard or product. An interface specification for a specific standard or product is adapted to the particular bit stream syntax and semantics of the standard/product.
Although some prior designs have proposed mapping particular decoding operations to different processing units, such as by mapping particular decoding operations to GPUs, prior designs are limited in terms of flexibility and efficiency. For example, a design that statically determines which processing units will perform particular decoding operations is susceptible to long periods of inactivity when processing units are forced to wait for their assigned operations to begin.