The availability of high speed digital devices and large, fast memories has made it possible to give practiced expression to an old idea of more efficiently utilizing a given bandwidth for a video transmission by only transmitting encoded digital signals representing the changes between successive frames, or groups of frames. To achieve high compression, one must resort not only to redundancy reduction but also to irrelevancy reduction, coarse coding that exploits characteristics of human visual perception. Spatial limits in human vision have been exploited extensively in many systems, especially in adaptive quantization using the discrete cosine transform (e.g., in the DCT quantization matrix), and in other techniques such as subband coding, and multiresolution representation. Temporal data reduction is based upon the recognition that between successive frames of video images there is high correlation. However, there has been very little work on applying temporal characteristics of human vision to image coding systems, except in the most basic ways such as determining a frame rate, e.g. 24-60 frames/secs. This is partly because of the anticipated higher complexity of temporal processing than of spatial processing, and the difficulty of including the temporal dimension in defining a standard measure of perceptual quality for video sequences.
In a standard promulgated in November 1991 by the Motion Picture Expert Group, MPEG identified as (ISO-IEC/JTC1/SC2/WG12), the sequence of raw image data frames are divided into successive groups known as GOP's (group of pictures), respectively, and the coded GOP is comprised of independent frames I, predicted frames P and bidirectionally predicted frames B in such manner that GOP may be comprised, for example, as follows:
I, B, B, P, B, B, P, B, B, P, B, B, P, B, B.
The first P frame is derived from the previous I frame, while the remaining P frames are derived from the last previous P frame. The I and P frames are reference frames. Since each B frame is derived from the closest reference frames on either side of it, the pertinent P frames must be derived before the prior B frames can be derived.
The high definition independent frames I at the beginning of each GOP are required because of the use of frame differential encoding to avoid accumulated error. The purpose of quantizing is to control the number of bits used in representing the changes between the frames. The corresponding portions of frames are conveyed by motion vectors indicating where blocks in the reference frames may be located in the frame being derived. Since differences generally arise from motion and there is likely to be more motion between P frames than between a P and a B frame, more bits are required to derive a P frame from a P frame than in deriving B frame from P frames on either side of it.
In typical MPEG systems, the three frame spacing of reference frames utilizing high numbers of bits is required in order to adequately convey motion. If, however, there is little or no motion, the number of bits devoted to representing these frames is excessive.