A video codec is a device or software module that enables the use of data compression techniques for digital video data. A video sequence consists of a number of pictures (digital images), usually called frames. Subsequent frames are very similar, thus containing a lot of redundancy from one frame to the next. Before being efficiently transmitted over a channel or stored in memory, video data is compressed to conserve both bandwidth and memory. The goal of video compression is to remove the redundancy, both within frames (spatial redundancy) and between frames (temporal redundancy) to gain better compression ratios. There is a complex balance between the video quality, the quantity of the data needed to represent it (also known as the bit rate), the complexity of the encoding and decoding algorithms, their robustness to data losses and errors, ease of editing, random access, end-to-end delay, and a number of other factors.
A typical digital video codec design starts with the conversion of input video from a RGB color format to a YCbCr color format, and is often followed by chroma sub-sampling to produce a sampling grid pattern. Conversion to the YCbCr color format improves compressibility by de-correlating the color signals, and separating the perceptually more important luma signal from the perceptually less important chroma signal, and which can be represented at lower resolution.
Some amount of spatial and temporal down-sampling may also be used to reduce the raw data rate before the basic encoding process. Down-sampling is the process of reducing the sampling rate of a signal. This is usually done to reduce the data rate or the size of the data. The down-sampling factor is typically an integer or a rational fraction greater than unity. This data is then transformed using a frequency transform to further de-correlate the spatial data. One such transform is a discrete cosine transform (DCT). The output of the transform is then quantized and entropy encoding is applied to the quantized values. Quantization is a compression technique where a range of values is compressed to a single quantum value.
The decoding process consists of essentially performing an inversion of each stage of the encoding process. The one stage that cannot be exactly inverted is the quantization stage. There, a best-effort approximation of inversion is performed. This part of the process is often called “inverse quantization” or “dequantization”, although quantization is an inherently non-invertible process.
A variety of codecs can be easily implemented on PCs and in consumer electronics equipment. Multiple codecs are often available in the same product, avoiding the need to choose a single dominant codec for compatibility reasons.
In general, video compression is performed according to many standards, including one or more standards for audio and video compression from the Moving Picture Experts Group (MPEG), such as MPEG-1, MPEG-2, and MPEG-4. Additional enhancements have been made as part of the MPEG-4 part 10 standard, also referred to as H.264, or AVC (Advanced Video Coding). Under the MPEG standards, video data is first encoded (e.g. compressed) and then stored in an encoder buffer on an encoder side of a video system. Later, the encoded data is transmitted to a decoder side of the video system, where it is stored in a decoder buffer, before being decoded so that the corresponding pictures can be viewed.
The intent of the H.264/AVC project was to develop a standard capable of providing good video quality at bit rates that are substantially lower than what previous standards would need (e.g. MPEG-2, H.263, or MPEG-4 Part 2). Furthermore, it was desired to make these improvements without such a large increase in complexity that the design is impractical to implement. An additional goal was to make these changes in a flexible way that would allow the standard to be applied to a wide variety of applications such that it could be used for both low and high bit rates and low and high resolution video. Another objective was that it would work well on a very wide variety of networks and systems.
H.264/AVC/MPEG-4 Part 10 contains many new features that allow it to compress video much more effectively than older standards and to provide more flexibility for application to a wide variety of network environments. Some key features include multi-picture motion compensation using previously-encoded pictures as references, variable block-size motion compensation (VBSMC) with block sizes as large as 16×16 pixels and as small as 4×4 pixels, six-tap filtering for derivation of half-pel luma sample predictions, macroblock pair structure, quarter-pixel precision for motion compensation, weighted prediction, an in-loop deblocking filter, an exact-match integer 4×4 spatial block transform, a secondary Hadamard transform performed on “DC” coefficients of the primary spatial transform wherein the Hadamard transform is similar to a fast Fourier transform, spatial prediction from the edges of neighboring blocks for “intra” coding, context-adaptive binary arithmetic coding (CABAC), context-adaptive variable-length coding (CAVLC), a simple and highly-structured variable length coding (VLC) technique for many of the syntax elements not coded by CABAC or CAVLC, referred to as Exponential-Golomb coding, a network abstraction layer (NAL) definition, switching slices, flexible macroblock ordering, redundant slices (RS), supplemental enhancement information (SEI) and video usability information (VUI), auxiliary pictures, frame numbering and picture order count. These techniques, and several others, allow H.264 to perform significantly better than prior standards, and under more circumstances and in more environments. H.264 usually performs better than MPEG-2 video by obtaining the same quality at half of the bit rate or even less.
MPEG is used for the generic coding of moving pictures and associated audio and creates a compressed video bit-stream made up of a series of three types of encoded data frames. The three types of data frames are an intra frame (called an I-frame or I-picture), a bi-directional predicted frame (called a B-frame or B-picture), and a forward predicted frame (called a P-frame or P-picture). These three types of frames can be arranged in a specified order called the GOP (Group Of Pictures) structure. I-frames contain all the information needed to reconstruct a picture. The I-frame is encoded as a normal image without motion compensation. On the other hand, P-frames use information from previous frames and B-frames use information from previous frames, a subsequent frame, or both to reconstruct a picture. Specifically, P-frames are predicted from a preceding I-frame or the immediately preceding P-frame.
Frames can also be predicted from the immediate subsequent frame. In order for the subsequent frame to be utilized in this way, the subsequent frame must be encoded before the predicted frame. Thus, the encoding order does not necessarily match the real frame order. Such frames are usually predicted from two directions, for example from the I- or P-frames that immediately precede or the P-frame that immediately follows the predicted frame. These bidirectionally predicted frames are called B-frames.
There are many possible GOP structures. A common GOP structure is 15 frames long, and has the sequence I_BB_P_BB_P_BB_P_BB_P_BB_. A similar 12-frame sequence is also common. I-frames encode for spatial redundancy, P and B-frames for both temporal redundancy and spatial redundancy. Because adjacent frames in a video stream are often well-correlated, P-frames and B-frames are only a small percentage of the size of I-frames. However, there is a trade-off between the size to which a frame can be compressed versus the processing time and resources required to encode such a compressed frame. The ratio of I, P and B-frames in the GOP structure is determined by the nature of the video stream and the bandwidth constraints on the output stream, although encoding time may also be an issue. This is particularly true in live transmission and in real-time environments with limited computing resources, as a stream containing many B-frames can take much longer to encode than an I-frame-only file.
B-frames and P-frames require fewer bits to store picture data, generally containing difference bits for the difference between the current frame and a previous frame, subsequent frame, or both. B-frames and P-frames are thus used to reduce redundancy information contained across frames. In operation, a decoder receives an encoded B-frame or encoded P-frame and uses a previous or subsequent frame to reconstruct the original frame. This process is much easier and produces smoother scene transitions when sequential frames are substantially similar, since the difference in the frames is small.
Each video image is separated into one luminance (Y) and two chrominance channels (also called color difference signals Cb and Cr). Blocks of the luminance and chrominance arrays are organized into “macroblocks,” which are the basic unit of coding within a frame.
In the case of I-frames, the actual image data is passed through an encoding process. However, P-frames and B-frames are first subjected to a process of “motion compensation.” Motion compensation is a way of describing the difference between consecutive frames in terms of where each macroblock of the former frame has moved. Such a technique is often employed to reduce temporal redundancy of a video sequence for video compression. Each macroblock in the P-frames or B-frame is associated with an area in the previous or next image that it is well-correlated, as selected by the encoder using a “motion vector.” The motion vector that maps the macroblock to its correlated area is encoded, and then the difference between the two areas is passed through the encoding process.
Conventional video codecs use motion compensated prediction to efficiently encode a raw input video stream. The macroblock in the current frame is predicted from a displaced macroblock in the previous frame. The difference between the original macroblock and its prediction is compressed and transmitted along with the displacement (motion) vectors. This technique is referred to as inter-coding prediction, which is the approach used in the MPEG standards.
Within the H.264/AVC standard, macroblocks are encoded using a single transform algorithm, the discrete cosine transform (DCT), and a selected one of nine available intra-prediction algorithms. A mode selection algorithm is used to determine the best fit intra-prediction algorithm. The term “intra” refers to the fact that the various compression techniques are performed relative to data that is contained only within the current frame, and not relative to any other frame in the video sequence. In other words, no temporal processing is performed outside of the current picture or frame. Image data is received from an image data source. The coding process varies greatly depending on the type of encoder used, but the most common steps usually include: partitioning into macroblocks, transform, quantization, and entropy encoding.
FIG. 1 illustrates a schematic block diagram of an exemplary AVC-based encoder. The AVC-based encoder utilizes transform T, quantization Q, entropy coding E, and intra-prediction P to encode each macroblock. Although not included in FIG. 1, AVC-based encoders also utilize inter-frame prediction, also referred to as motion compensation. However, for purposes of this discussion, the AVC-based encoder is directed to intra-frame coding techniques. An image or frame to be encoded is partitioned into macroblocks, or blocks. Each block includes a set of pixels, for example a 4×4 block of pixels or an 8×8 block of pixels. The AVC-based encoder compresses the pixel data of each block using the intra-prediction P and the transform T. For each pixel block xi, one of the known intra-predictions, Pk, is used to determine a predicted value Pni for the pixel block xi. In many applications, there are nine intra-predictions available. This predicted block Pni is compared to the actual pixel block xi. The difference between the actual value and the predicted value is referred to as the residual block ei. The intra-prediction Pk generates the predicted block Pni based on similarities among the pixel block xi and the pixels adjacent to the pixel block xi. Specifically, the pixel value in the pixel block xi is predicted using pre-coded adjacent pixel values, referred to as reconstructed neighborhood pixels ni. A mode selection algorithm determines a best fit intra-prediction mode Pk used to generate the predicted block Pni. To determine the best fit intra-prediction mode Pk, the mode selection algorithm applies each of the k available intra-predictions to generate k preliminary prediction results. Each of these k preliminary prediction results are compared using rate distortion measures to determine the best fit. The rate distortion measure is a linear combination of the number of bits for encoding the block and the sum of square of the difference between the original block and the encoded block as in the VCEG JM software codec and the VCEG KTA software codec. A best fit intra-prediction is determined for each pixel block.
The residual block ei is further compressed using the transform T. In the AVC-based encoder, the transform T uses the discrete cosine transform (DCT) to transform the residual block ei into its frequency components. In other words, the residual block ei is transformed from pixel data to frequency components. All information contained in the original residual block ei is preserved during transformation, and is therefore reversible, such as by the inverse transform V.
The transformed residual block is then quantized according to a defined quantization parameter (QP). The quantized results along with an identification of the intra-prediction Pk are coded by entropy coder E. Exemplary entropy coding techniques include, but are not limited to, VLC (variable length coding), CAVLC (context-adaptive variable length coding), and CABAC (context-adaptive binary arithmetic coding). A best effort approximation of inverting the frequency components is performed by the inverse quantization Q−1.
The K-Technical Area (KTA) expands on the H.264/AVC standard. In particular, the KTA includes a Mode Dependent Directional Transform (MDDT) where for each of the intra-predictions Pk, there is defined a corresponding transform Tk. In other words, for each of the intra-predictions Pk, there is a corresponding one transform Tk. For example, mode 1 refers to the pair of intra-prediction P1 and the transform T1. The intra-prediction/transform pairs for each mode k are used together. In contrast, the H.264/AVC standard specifies only a single transform used irrespective of the intra-prediction Pk. The method used in the KTA-MDDT to determine the best fit intra-prediction Pk is the same as the H.264/AVC standard. As each intra-prediction and transform pair are previously known and defined, once the best-fit intra-prediction Pk is determined, the transform Tk previously associated with the intra-prediction Pk is automatically known.