MPEG Background
MPEG-2 and MPEG-4 are international video compression standards defining a video syntax that provides an efficient way to represent image sequences in the form of more compact coded data. The language of the coded bits is the “syntax.” For example, a few tokens can represent an entire block of samples (e.g., 64 samples for MPEG-2). Both MPEG standards also describe a decoding (reconstruction) process where the coded bits are mapped from the compact representation into an approximation of the original format of the image sequence. For example, a flag in the coded bitstream signals whether the following bits are to be preceded with a prediction algorithm prior to being decoded with a discrete cosine transform (DCT) algorithm. The algorithms comprising the decoding process are regulated by the semantics defined by these MPEG standards. This syntax can be applied to exploit common video characteristics such as spatial redundancy, temporal redundancy, uniform motion, spatial masking, etc. In effect, these MPEG standards define a programming language as well as a data format. An MPEG decoder must be able to parse and decode an incoming data stream, but so long as the data stream complies with the corresponding MPEG syntax, a wide variety of possible data structures and compression techniques can be used (although technically this deviates from the standard since the semantics are not conformant). It is also possible to carry the needed semantics within an alternative syntax.
These MPEG standards use a variety of compression methods, including intraframe and interframe methods. In most video scenes, the background remains relatively stable while action takes place in the foreground. The background may move, but a great deal of the scene is redundant. These MPEG standards start compression by creating a reference frame called an “intra” frame or “I frame”. I frames are compressed without reference to other frames and thus contain an entire frame of video information. I frames provide entry points into a data bitstream for random access, but can only be moderately compressed. Typically, the data representing I frames is placed in the bitstream every 12 to 15 frames (although it is also useful in some circumstances to use much wider spacing between I frames). Thereafter, since only a small portion of the frames that fall between the reference I frames are different from the bracketing I frames, only the image differences are captured, compressed, and stored. Two types of frames are used for such differences—predicted or P frames, and bi-directional interpolated or B frames.
P frames generally are encoded with reference to a past frame (either an I frame or a previous P frame), and, in general, are used as a reference for subsequent P frames. P frames receive a fairly high amount of compression. B frames provide the highest amount of compression but require both a past and a future reference frame in order to be encoded. Bi-directional frames are never used for reference frames in standard compression technologies.
Macroblocks are regions of image pixels. For MPEG-2, a macroblock is a 16×16 pixel grouping of four 8×8 DCT blocks, together with one motion vector for P frames, and one or two motion vectors for B frames. Macroblocks within P frames may be individually encoded using either intra-frame or inter-frame (predicted) coding. Macroblocks within B frames may be individually encoded using intra-frame coding, forward predicted coding, backward predicted coding, or both forward and backward (i.e., bi-directionally interpolated) predicted coding. A slightly different but similar structure is used in MPEG-4 video coding.
After coding, an MPEG data bitstream comprises a sequence of I, P, and B frames. A sequence may consist of almost any pattern of I, P, and B frames (there are a few minor semantic restrictions on their placement). However, it is common in industrial practice to have a fixed pattern (e.g., IBBPBBPBBPBBPBB).
MPEG Color Space Representation
MPEG-1, MPEG-2, and MPEG-4 all utilize a Y, U, V color space for compression. There is a choice of luminance equation, but a typical conversion transformation between RGB (red-green-blue) to a YUV representation is expressed as:Y=0.59G+0.29R+0.12B U=R−Y V=B−Y 
The Y luminance factors for green range from 0.55 up to 0.75, depending upon the color system. The factors for red range from 0.2 to 0.3, and the factors for blue range from 0.05 to 0.15.
This transformation can be cast as a matrix transformation, which is a linear operator intended for use on linear signals. However, this simple transformation is performed in MPEG 1, 2, and 4 in the non-linear video space, yielding various artifacts and problems.
It is typical in MPEG to reduce the resolution of the U and V chroma channels to achieve higher compression. The most commonly used reduction of resolution is to use half resolution both vertically and horizontally. MPEG-2 supports full resolution chroma, as well as half resolution horizontally. However, the most commonly used MPEG-2 profiles, Main Profile at Main Level (MP @ ML) and Main Profile at High Level (MP @ HL), use half resolution horizontally and vertically. MPEG-4 versions 1 and 2 use half resolution vertically and horizontally. Note that full chroma resolution is often called 4:4:4, half chroma horizontal resolution is often called 4:2:2, and half vertical and horizontal resolution is often called 4:2:0. (It should be noted that the 4:x:x nomenclature is flawed in its meaning and derivation, but it is common practice to use it to describe the chroma resolution relationship to luminance.)
The filter which reduces the horizontal and vertical chroma resolution under the various MPEG standards is applied to non-linear video signals as transformed into the U and V color representation. When the inverse transformation is applied to recover RGB, the non-linear signals and the filters interact in such a way as to produce artifacts and problems. These problems can be generalized as “crosstalk” between the Y luminance and the U and V chroma channels, along with spatial aliasing.
Further information on linear versus non-linear representations and transformations may be found in “The Use of Logarithmic and Density Units for Pixels” by Gary Demos, presented at the October 1990 SMPTE conference, and published in the SMPTE Journal (October 1991, vol. 100, no. 10). See also “An Example Representation for Image Color and Dynamic Range which is Scalable, Interoperable, and Extensible” by Gary Demos, presented at the October 1993 SMPTE conference and published in the proceedings and preprints. These papers describe the benefits of logarithmic and linear spaces at various stages of the image compression processing pipeline, and are hereby incorporated by reference.
Chroma Sub-Sampling
The reason for reducing chroma resolution for U and V is that the human visual system is less sensitive to changes in U and V than it is to changes in luminance, Y. Since Y is mostly green, and U and V are mostly red, and blue respectively, this can also be described as a human visual sensitivity being higher for green than for red and blue. However, although U and V are treated the same in MPEG-1, MPEG-2, and MPEG-4, the human visual system is more sensitive to U (with its red component) than to V (with its blue component).
This difference in chroma sensitivity is embodied in the 1951 NTSC-2 color standard that is used for television. NTSC-2 uses a YIQ color space, where I and Q are similar to U and V (with slightly different weightings). That is, the I channel primarily represents red minus luminance and the Q channel primarily represents blue minus luminance. In NTSC-2, the luminance is given 4.5 MHz of analog bandwidth, and the I chroma channel is given 1.5 MHz of analog bandwidth. The Q channel, representing the blue-yellow axis, is given only 0.5 MHz of analog bandwidth.
Thus, the NTSC-2 television system allocates three times as much information to the I channel than it does to the Q channel, and three times as much information to the Y luminance channel than to the I channel. Therefore, the bandwidth ratio between the Y luminance channel and the Q (blue minus luminance) channel is nine. These MPEG YUV and NTSC-2 relationships are summarized in the following table:
RatioYUV 4:4:4YUV 4:2:2YUV 4:2:0NTSC-2Red, U, and I pixels1:12:14:13:1to YBlue, V, and1:12:14:19:1Q pixels to Y