A video sequence consists of a number of pictures, usually called frames. Subsequent frames are very similar, thus containing a lot of redundancy from one frame to the next. Before being efficiently transmitted over a channel or stored in memory, video data is compressed to conserve both bandwidth and memory. The goal is to remove the redundancy to gain better compression ratios. A first video compression approach is to subtract a reference frame from a given frame to generate a relative difference. A compressed frame contains less information than the reference frame. The relative difference can be encoded at a lower bit-rate with the same quality. The decoder reconstructs the original frame by adding the relative difference to the reference frame.
A more sophisticated approach is to approximate the motion of the whole scene and the objects of a video sequence. The motion is described by parameters that are encoded in the bit-stream. Pixels of the predicted frame are approximated by appropriately translated pixels of the reference frame. This approach provides an improved predictive ability than a simple subtraction. However, the bit-rate occupied by the parameters of the motion model must not become too large.
In general, video compression is performed according to many standards, including one or more standards for audio and video compression from the Moving Picture Experts Group (MPEG), such as MPEG-1, MPEG-2, and MPEG-4. Additional enhancements have been made as part of the MPEG-4 part 10 standard, also referred to as H.264, or AVC (Advanced Video Coding). Under the MPEG standards, video data is first encoded (e.g. compressed) and then stored in an encoder buffer on an encoder side of a video system. Later, the encoded data is transmitted to a decoder side of the video system, where it is stored in a decoder buffer, before being decoded so that the corresponding pictures can be viewed.
The intent of the H.264/AVC project was to develop a standard capable of providing good video quality at bit rates that are substantially lower than what previous standards would need (e.g. MPEG-2, H.263, or MPEG-4 Part 2). Furthermore, it was desired to make these improvements without such a large increase in complexity that the design is impractical to implement. An additional goal was to make these changes in a flexible way that would allow the standard to be applied to a wide variety of applications such that it could be used for both low and high bit rates and low and high resolution video. Another objective was that it would work well on a very wide variety of networks and systems.
H.264/AVC/MPEG-4 Part 10 contains many new features that allow it to compress video much more effectively than older standards and to provide more flexibility for application to a wide variety of network environments. Some key features include multi-picture motion compensation using previously-encoded pictures as references, variable block-size motion compensation (VBSMC) with block sizes as large as 16×16 and as small as 4×4, six-tap filtering for derivation of half-pel luma sample predictions, macroblock pair structure, quarter-pixel precision for motion compensation, weighted prediction, an in-loop deblocking filter, an exact-match integer 4×4 spatial block transform, a secondary Hadamard transform performed on “DC” coefficients of the primary spatial transform wherein the Hadamard transform is similar to a fast Fourier transform, spatial prediction from the edges of neighboring blocks for “intra” coding, context-adaptive binary arithmetic coding (CABAC), context-adaptive variable-length coding (CAVLC), a simple and highly-structured variable length coding (VLC) technique for many of the syntax elements not coded by CABAC or CAVLC, referred to as Exponential-Golomb coding, a network abstraction layer (NAL) definition, switching slices, flexible macroblock ordering, redundant slices (RS), supplemental enhancement information (SEI) and video usability information (VUI), auxiliary pictures, frame numbering and picture order count. These techniques, and several others, allow H.264 to perform significantly better than prior standards, and under more circumstances and in more environments. H.264 usually performs better than MPEG-2 video by obtaining the same quality at half of the bit rate or even less.
MPEG is used for the generic coding of moving pictures and associated audio and creates a compressed video bit-stream made up of a series of three types of encoded data frames. The three types of data frames are an intra frame (called an I-frame or I-picture), a bi-directional predicted frame (called a B-frame or B-picture), and a forward predicted frame (called a P-frame or P-picture). These three types of frames can be arranged in a specified order called the GOP (Group Of Pictures) structure. I-frames contain all the information needed to reconstruct a picture. The I-frame is encoded as a normal image without motion compensation. On the other hand, P-frames use information from previous frames and B-frames use information from previous frames, a subsequent frame, or both to reconstruct a picture. Specifically, P-frames are predicted from a preceding I-frame or the immediately preceding P-frame.
Frames can also be predicted from the immediate subsequent frame. In order for the subsequent frame to be utilized in this way, the subsequent frame must be encoded before the predicted frame. Thus, the encoding order does not necessarily match the real frame order. Such frames are usually predicted from two directions, for example from the I- or P-frames that immediately precede or the P-frame that immediately follows the predicted frame. These bidirectionally predicted frames are called B-frames.
There are many possible GOP structures. A common GOP structure is 15 frames long, and has the sequence I_BB_P_BB_P_BB_P_BB_P_BB_. A similar 12-frame sequence is also common. I-frames encode for spatial redundancy, P and B-frames for temporal redundancy. Because adjacent frames in a video stream are often well-correlated, P-frames and B-frames are only a small percentage of the size of I-frames. However, there is a trade-off between the size to which a frame can be compressed versus the processing time and resources required to encode such a compressed frame. The ratio of I, P and B-frames in the GOP structure is determined by the nature of the video stream and the bandwidth constraints on the output stream, although encoding time may also be an issue. This is particularly true in live transmission and in real-time environments with limited computing resources, as a stream containing many B-frames can take much longer to encode than an I-frame-only file.
B-frames and P-frames require fewer bits to store picture data, generally containing difference bits for the difference between the current frame and a previous frame, subsequent frame, or both. B-frames and P-frames are thus used to reduce redundancy information contained across frames. In operation, a decoder receives an encoded B-frame or encoded P-frame and uses a previous or subsequent frame to reconstruct the original frame. This process is much easier and produces smoother scene transitions when sequential frames are substantially similar, since the difference in the frames is small.
Each video image is separated into one luminance (Y) and two chrominance channels (also called color difference signals Cb and Cr). Blocks of the luminance and chrominance arrays are organized into “macroblocks,” which are the basic unit of coding within a frame.
In the case of I-frames, the actual image data is passed through an encoding process. However, P-frames and B-frames are first subjected to a process of “motion compensation.” Motion compensation is a way of describing the difference between consecutive frames in terms of where each macroblock of the former frame has moved. Such a technique is often employed to reduce temporal redundancy of a video sequence for video compression. Each macroblock in the P-frames or B-frame is associated with an area in the previous or next image that it is well-correlated, as selected by the encoder using a “motion vector.” The motion vector that maps the macroblock to its correlated area is encoded, and then the difference between the two areas is passed through the encoding process.
Conventional video codecs use motion compensated prediction to efficiently encode a raw input video stream. The macroblock in the current frame is predicted from a displaced macroblock in the previous frame. The difference between the original macroblock and its prediction is compressed and transmitted along with the displacement (motion) vectors. This technique is referred to as inter-coding, which is the approach used in the MPEG standards.
The output bit-rate of an MPEG encoder can be constant or variable, with the maximum bit-rate determined by the playback media. To achieve a constant bit-rate, the degree of quantization is iteratively altered to achieve the output bit-rate requirement. Increasing quantization leads to visible artifacts when the stream is decoded. The discontinuities at the edges of macroblocks become more visible as the bit-rate is reduced.
When the bit rate is fixed, the effective bit allocation can obtain better visual quality in video encoding. Conventionally, each frame is divided into foreground and background. More bits are typically allocated to the foreground objects and fewer bit are allocated to the background area based on the reasoning that viewers focus more on the foreground than the background. Such reasoning is based on the assumption that the viewer may not see the difference in the background if they do not focus on it. However, this is not always the case. Moreover, due to the characteristics of the H.264 standard, less bits in the background often leads to blurring, and the intra refresh phenomenon is very obvious when the background quality is low. The refresh in the static area, usually the background, annoys the human eye significantly and thus influences the visual quality.
To improve the quality of the background, a simple method allocates more bits to the background. This strategy will reduce the bits allocated to the foreground area, which is not an acceptable trade-off. Also, to make the fine details observable, the quantization scale needs to be reduced considerably, which means the bit-rate budget will be exceeded.
Another disadvantage is that the assumption of repetition of image sequence content is not true for most of the sequence. In most cases, the motion is mostly going along in one direction within several seconds. There is a limited match in previous frames for uncovered objects in the current frame. Unfortunately, state of the art long term motion prediction methods focus on the earlier frames as the reference.
An objective of the H.264 standard is to enable quality video at bit-rates that are substantially lower than what the previous standards would need. An additional objective is to provide this functionality in a flexible manner that allows the standard to be applied to a very wide variety of applications and to work well on a wide variety of networks and systems. Unfortunately, conventional encoders employing the MPEG standards tend to blur the fine texture details even in a relative high bit-rate. Also, the I-frame refresh is very obvious when the low bit-rate is used. As such, whenever an I-frame is displayed, the quality is much greater than the previous, non I-frames, which produces a discontinuity whenever the I-frame is displayed. Such a discontinuity is noticeable to the user. Although the MPEG video coding standard specifies a general coding methodology and syntax for the creation of a legitimate MPEG bit-stream, there are many opportunities left open to improve the quality of MPEG bit-streams.
In H.264/AVC intra coding, two intra macroblock modes are supported. They include the 4×4 intra-prediction mode and the 16×16 intra-prediction mode. If a sub-block or macroblock is encoded in intra mode, a prediction block is formed based on previously encoded and reconstructed blocks. The prediction block is subtracted from the current block prior to encoding. There are a total of 9 prediction modes for 4×4 luma blocks and 4 prediction modes for 16×16 luma blocks. The prediction mode that has the least distortion is chosen as the best mode for the block.
There are many methods to find the distortion, but the sum of absolute differences (SAD) or the sum of absolute transformed distances (SATD) are commonly used in low-complexity mode decision. Usually the coding performance by selecting SATD is 0.2-0.5 dB better, but computing the SATD is more time-consuming than finding the SAD.
Text Description of Joint Model Reference Encoding Methods and Decoding Concealment Methods, <http://ftp3.itu.ch/av-arch/jvt-site/2005—01_HongKong/JVT-N046r1.doc> (February 2005) discloses a method for concealing errors and losses in a decoder for video data conforming to the H.264 standard. The article discloses a method of calculating SATD, and an application of Hadamard transforms for computing SATD.
U.S. Patent Application No. 2005/0069211 to Lee et al. discloses a prediction method for calculating an average of intra-prediction costs or an average of inter-prediction costs of macroblocks of a received picture by encoding the received picture using an intra-prediction and/or inter-prediction in consideration of a type of the received picture, calculating a threshold value using the calculated average of intra-predication costs and/or inter-predication costs, and determining whether to perform intra-prediction on a subsequent picture based on the calculated threshold value. The amount of computations is reduced by reducing the number of macroblocks that undergo intra-prediction.