The development of low bit-rate encoding techniques for reducing the storage and transmission requirements for digital audio and video coupled with the availability of consumer broadband Internet connections has made it possible and cost-effective to transmit large amounts of media over the Internet. This same technology makes it possible for consumers to record and share video content over the Internet.
There are many encoding technologies used for storing, transmitting, and distributing digital video. The international standards developed by the Moving Pictures Expert Group (MPEG-1, MPEG-2, and MPEG-4) have achieved widespread adoption.
The first MPEG digital video and audio encoding standard, ISO/IEC 11172 (MPEG-1), was adopted as an international standard in 1992. The MPEG-1 standard provides VHS-quality digital video and audio at a bit rate of approximately 1.5 Mbps for CD-ROM playback. MPEG-1 video compression exploits the information redundancy within individual video frames as well as that between video frames to compress a video sequence.
Video frames are encoded using the YCbCr luminance (brightness) and chrominance (color difference) representation. This luminance/chrominance representation is used because the human eye distinguishes differences in brightness more readily than differences in color. As a result, the Cb and Cr chrominance components can be encoded at a lower resolution than the Y luminance components without significantly impacting the video quality. The YCbCr data is processed using the Discrete Cosine Transform (DCT) to compact the signal energy prior to quantization. MPEG-1 video encoding utilizes three frame types: Intra-coded Frames, forward-Predicted frames and Backward-predicted frames. Intra-coded frames, or I-frames, are self-contained and do not require information from previous or future frames to be decoded. Forward-predicted frames or P-frames are encoded as predictions relative to a previous I-frame or P-frame. Backward-predicted frames or B-frames are encoded as predictions relative to either a previous or future I-frame or P-frame, or both.
International Standard ISO/IEC 13818 (MPEG-2) was published as a standard in 1994 and provides higher quality and higher bitrate video encoding than MPEG-1. MPEG-2, which is backwards compatible with MPEG-1, was designed to be very scalable and flexible, supporting bitrates ranging from approximately 2 Mbps to more than 20 Mbps and video resolutions ranging from 352×240 pixels to 1920×1080 pixels. In addition, MPEG-2 added support for encoding interlaced video.
ISO/IEC International Standard 14496 (MPEG-4), is the most recent MPEG encoding standard and was ratified in 1999. MPEG-4 is a scalable standard supporting data rates from less than 64 Kbps for Internet streaming video to about 4 Mbps for higher-bandwidth applications. MPEG-4 differs from MPEG-2 and MPEG-1 in that it includes object recognition and encoding, as well as synchronized text and metadata tracks. MPEG-4 supports both progressive and interlaced video encoding and is object-based, coding multiple video object planes into images of arbitrary shape.
The popular DivX encoder commonly used for sharing of video files over the Internet is based on MPEG-4.
Models of Visual Perception
All video encoders rely on properties of the human visual system. This section discusses the visual perceptual model as it relates to video coding. Distortion sensitivity profiles for human perception are specified as functions of frequency, luminance, texture, and temporal parameters.
Frequency Sensitivity
The human visual system exhibits different levels of sensitivity to visual information at different frequencies. This characteristic is known as the Contrast Sensitivity Function (CSF), and is generally accepted to have a band-pass frequency shape. Video encoders can use the CSF response to guide the allocation of coding bits to shape the resulting distortion so that it is less visible.
Luminance Sensitivity
The human visual system's ability to detect objects against a background varies depending to the background luminance level. In general, the human visual system is most sensitive with a medium luminance background and least sensitive with either very dark or very bright backgrounds.
Texture Sensitivity
The human visual system is less able to detect objects in areas where an image exhibits significant variations in the background luminance. This effect is known as texture masking and can be exploited by a video codec to improve coding efficiency.
Temporal Sensitivity
The human visual system is less able to detect details in objects that are moving within a video sequence. This effect is known as temporal masking and can be exploited by a video codec to improve coding efficiency.
Just-Noticeable-Distortion (JND)
The just-noticeable distortion (JND) profile is the visibility threshold of distortion, below which distortions are imperceptible. The JND profile of an image depends on the image contents. All of the characteristics of the human visual system described earlier—frequency, luminance, texture and temporal sensitivity—should be taken into consideration in deriving the JND profile. FIG. 1 shows JND plotted as a function of inter-frame luminance difference.
Video Coding Technologies
The following sections present some of the common techniques which exploit the human visual properties and which form the basis of current video coding technologies.
The basic processing blocks of an exemplary video encoder applied on an intra-frame basis are: the video filter, discrete cosine transform, coefficient quantizer, and run-length coding/variable length coding. FIG. 2 shows an example of the process for intra-frame video coding.
Color Space Sub-sampling
Video encoders operate on a color space that takes advantage of the eye's different sensitivity to luminance and chrominance information. As such, video encoders use the YCbCr or YUV color space to allow the luminance and chrominance to be encoded at different resolutions. Typically the chrominance information is encoded at one-quarter or one-half the resolution of the luminance information. The chrominance signals need to be filtered to generate this format. The actual filtering technique is left to the system designer as one of several parameters that may be optimized on a cost versus performance basis.
DCT Coefficients
Video encoders use an invertible transform to reduce the correlation between neighboring pixels within an image. The Discrete Cosine Transform (DCT) has been shown to be near optimal for a large class of images in terms of energy concentration and de-correlation. The DCT transforms a spatial 8×8 pixel block into a block of 8×8 DCT coefficients. Each coefficient represents a weighting value for each of the 64 orthogonal basis patterns shown in FIG. 3. The DCT coefficients toward the upper left-hand corner of the coefficient matrix correspond to smoother spatial contours, while the DCT coefficients toward the lower right-hand corner of the coefficient matrix correspond to finer spatial patterns.
Variable-Length Coding
Video codecs take advantage of the human visual system's lower sensitivity to high frequency distortions by more coarsely quantizing or even omitting the high frequency DCT coefficients. When there are numerous zero-valued DCT coefficients, considerable coding efficiency can be achieved by representing these zero coefficients using a run-length coding scheme. But before that process is performed, more efficiency can be gained by reordering the DCT coefficients in a zigzag-scanning pattern as shown in FIG. 4.
Predictive Coding
Intra-frame (or I frame) coding techniques are limited to processing the current video frame on a spatial basis. Considerably more compression efficiency can be obtained with inter-frame coding techniques which exploit the temporal or time-based redundancies. Inter-frame coding uses a technique known as block-based motion compensated prediction using motion estimation. Inter-frame coding techniques are used within P-frames or B-frames.
Forward-predicted frames or P-frames are predicted from a previous I or P-frame. Bi-directional interpolated prediction frames or B-frames are predicted and interpolated from a previous I or P-frame and/or a succeeding I or P-frame.
As an example of the usage of I, P, and B frames, consider a group of pictures that lasts for 6 frames, and is given as I,B,P,B,P,B,I,B,P,B,P,B, . . . The I frames are coded spatially and the P frames are forward predicted based on previous I and P frames. B frames are coded based on forward prediction from a previous I or P frame, as well as backward prediction from a succeeding I or P frame. As such, the example sequence is processed by the encoder such that the first B frame is predicted from the first I frame and first P frame, the second B frame is predicted from the first and second P frames, and the third B frame is predicted from the second P frame and the first I frame of the next group of pictures. Most broadcast quality applications have tended to use two consecutive B frames as the ideal trade-off between compression efficiency and video quality as shown in FIG. 5.
The main advantage of using B frames is coding efficiency. In most cases, B frames will result in lower bit consumption. Use of B frames can also improve quality in the case of moving objects that reveal hidden areas within a video sequence. Backward prediction in this case allows the encoder to make more intelligent decisions on how to encode the video within these areas. Since B frames are not used to predict future frames, errors generated will not be propagated further within the sequence.
Motion Estimation and Compensation
The temporal prediction used in video encoders is based on motion estimation. The basic premise of motion estimation is that in most cases, consecutive video frames will be similar except for changes induced by objects moving within the frames. In the trivial case of zero motion between frames (and no other differences caused by noise), it is easy for the encoder to efficiently predict the current frame as a duplicate of the prediction frame. When this is done, the only information necessary to transmit to the decoder becomes the syntactic overhead necessary to reconstruct the picture from the original reference frame. When there is motion between frames, the situation is not as simple. The problem is to adequately represent the changes, or differences, between two video frames.
Motion estimation solves this problem by performing a comprehensive 2-dimensional spatial search for each luminance macroblock. Motion estimation is not calculated using chrominance, as it is assumed that the color motion can be adequately represented with the same motion information as the luminance. It should also be noted that video encoding standards do not define how this search should be performed. This is a detail that the system designer can choose to implement in one of many possible ways. It is well known that a full, exhaustive search over a wide 2-dimensional area yields the best matching results in most cases, but this performance comes at a high computational cost. As motion estimation is usually the most computationally expensive portion of the video encoder, some lower cost encoders choose to limit the pixel search range, or use other techniques such as telescopic searches, usually at some reduction in video quality.
Motion estimation operates by matching at the macroblock level. When a relatively good match has been found, the encoder assigns motion vectors to the macroblock that indicate how far horizontally and vertically the macroblock must be moved so that a match is made. As such, each forward and backward predicted macroblock contains two motion vectors and true bi-directionally predicted macroblocks utilize four motion vectors. FIG. 6 illustrates the calculation of motion vectors.
Macroblock Coding
After motion estimation is complete, the predicted frame is subtracted from the desired frame, leaving a less complicated residual error frame that can then be encoded much more efficiently. It can be seen that the more accurate the motion is estimated and matched, the more likely it will be that the residual error will approach zero, and the coding efficiency will be highest. Further coding efficiency is accomplished by taking advantage of the fact that motion vectors tend to be highly correlated between macroblocks. Because of this, the horizontal component is compared to the previously valid horizontal motion vector and only the difference is coded. This same difference is calculated for the vertical component before coding. These difference codes are then described with a variable length code for maximum compression efficiency. Of course not every macroblock search will result in an acceptable match. If the encoder decides that no acceptable match exists (again, the acceptable criteria is not video codec defined, and is up to the system designer) then it has the option of coding that particular macroblock as an intra macroblock, even though it may be in a P or B frame. In this manner, high quality video is maintained at a slight cost to coding efficiency. FIG. 7 illustrates an example of a general decision tree of macroblock coding models.
The macroblock may be encoded in any of the models, and each coding model consumes a different amount of the available bits. The macroblock coding model attack is designed to control the model decision resulting in the selection of a higher bit rate model.
Video Buffer and Rate Control
Most video encoders will encode video sequences to a specified bitrate. Meeting the desired bitrate is a rather complicated task as the encoder must deal with drastically different coding efficiencies for different regions within a video frame, different frames within a video sequence, and different coding methods for each video frame.
Because of these variations, it is necessary to buffer the encoded bit-stream before it is transmitted. Since the buffer must necessarily be limited in size (physical limitations and delay constraints), a feedback system must be used as a rate control mechanism to prevent underflow or overflow within the buffer.