Transmission of moving pictures in real-time is employed in several applications like e.g. video conferencing, net meetings, TV broadcasting and video telephony.
However, representing moving pictures requires bulk information as digital video typically is described by representing each pixel in a picture with 8 bits (1 Byte). Such uncompressed video data results in large bit volumes, and cannot be transferred over conventional communication networks and transmission lines in real time due to limited bandwidth.
Thus, enabling real time video transmission requires a large extent of data compression. Data compression may, however, compromise with picture quality. Therefore, great efforts have been made to develop compression techniques allowing real time transmission of high quality video over bandwidth limited data connections.
In video compression systems, the main goal is to represent the video information with as little capacity as possible. Capacity is defined with bits, either as a constant value or as bits/time unit. In both cases, the main goal is to reduce the number of bits.
Many video compression standards have been developed over the last 20 years. Many of those methods are standardized through ISO (the International Standards organization) or ITU (the International Telecommunications Union). Besides, a number of other proprietary methods have been developed. The main standardization methods are:
ITU: H.261, H.262, H.263, H.264
ISO: MPEG1, MPEG2, MPEG4/AVC)
The video data undergo four main processes before transmission, namely prediction, transformation, quantization and entropy coding.
The prediction process significantly reduces the amount of bits required for each picture in a video sequence to be transferred. It takes advantage of the similarity of parts of the sequence with other parts of the sequence. Since the predictor part is known to both encoder and decoder, only the difference has to be transferred. This difference typically requires much less capacity for its representation. The prediction is mainly based on picture content from previously reconstructed pictures where the location of the content is defined by motion vectors.
In a typical video sequence, the content of a present block M would be similar to a corresponding block in a previously decoded picture. If no changes have occurred since the previously decoded picture, the content of M would be equal to a block of the same location in the previously decoded picture. In other cases, an object in the picture may have been moved so that the content of M is more equal to a block of a different location in the previously decoded picture. Such movements are represented by motion vectors (V). As an example, a motion vector of (3; 4) means that the content of M has moved 3 pixels to the left and 4 pixels upwards since the previously decoded picture.
A motion vector associated with a block is determined by executing a motion search. The search is carried out by consecutively comparing the content of the block with blocks in previous pictures of different spatial offsets. The offset relative to the present block associated with the comparison block having the best match compared with the present block, is determined to be the associated motion vector.
In H.262, H.263, MPEG1, MPEG2 the same concept is extended so that motion vectors also can take ½ pixel values. A vector component of 5.5 then imply that the motion is midway between 5 and 6 pixels. More specifically the prediction is obtained by taking the average between the pixel representing a motion of 5 and the pixel representing a motion of 6. This is called a 2-tap filter due to the operation on 2 pixels to obtain prediction of a pixel in between. Motion vectors of this kind are often referred to as having fractional pixel resolution or fractional motion vectors. All filter operations can be defined by an impulse response. The operation of averaging 2 pixels can be expressed with an impulse response of (½, ½). Similarly, averaging over 4 pixels implies an impulse response of (¼, ¼, ¼, ¼).
The different frames are typically classified based on the respective coding methods that are being used in the coding and decoding of each frame. There are three different frame types being referred to in the MPEG standards—I-frames, B-frames and P-frames. An I-frame is encoded as a single image, with no reference to any past or future frames.
A P-frame is encoded relative to the past reference frame. A reference frame in this context is a P- or I-frame. The past reference frame is the closest preceding reference frame. Each macroblock in a P-frame can be encoded either as an I-macroblock or as a P-macroblock. An I-macroblock is encoded just like a macroblock in an I-frame. A P-macroblock is encoded using a prediction based on the past reference frame, plus an error term, and to specify the prediction based on the reference frame, one or more motion vectors are included.
A B-frame—as defined in e.g. MPEG1/2—is encoded relative to the past reference frame, the future reference frame, or both frames. The future reference frame is the closest following reference frame (I or P). The encoding for B-frames is similar to P-frames, except that motion vectors may refer to areas in the future reference frames.
Starting from the oldest of the standards mentioned above—H.261, simple forward prediction as illustrated in FIG. 1 was used. Prediction was made frame by frame in temporal order and prediction from the most recent reconstructed frame only was used. This can be referred to as simple forward prediction.
The concept of B-frames and bidirectional coding was introduced in MPEG1 and MPEG2. This is illustrated in FIG. 2. In bidirectional coding, the coding order and temporal order is not necessarily the same. That is, a B-frame can be predicted based on both past and future frames relative to the B-frame. Predictions for a block to be coded may also use data from more than one previously reconstructed frame.
Both these aspects are illustrated in FIG. 2 showing a sequence of alternating p- and b-frames. The coding order in this example would be: 1p-3p-2b-5p-4b. Predictions can only be derived from P-frames, and P-frames are predicted from the previous P-frame only. B-frames may be predicted from the previous P-frame or from the next (temporally) P-frame or from both. This possibility comes from the fact that both the P-frame before and after are coded and reconstructed before the B-frame is to be predicted and coded.
In H.264/MPEG4-AVC the coding order and prediction structure is defined to be even more general. For example, the coding order may be defined almost arbitrarily and the pre-diction of a block is typically limited to use pixels from two previously decoded frames. However, the two frames need not be exactly one before and one after the frame to be coded.
One special feature of H.264/MPEG4-AVC is a so called “direct” prediction mode. In this mode the motion vectors to code a block in a B-frame are obtained from already known vectors. Two typical cases are often used. The first one is often referred to as “temporal direct”. This means that the motion vector of a block in a P-frame is used to derive suitable downscaled motion vectors to predict the collocated block in the P-frame from two frames. Downscaling depends on the time position of the P-frame relative to the two frames. The second type is referred to as “spatial direct”. Two motion vectors to predict a block in a B-frame from two frames are produced in a similar way. However, the motion vectors are adapted from motion vectors of previously coded blocks in the B-frame. The main benefit with spatial direct is to avoid saving motion vectors from the P-frame.
Generally, the flexible ordering and prediction from multiple frames may result in more efficient video coding. On the other hand, the complexity of the prediction process is also typically much increased.
Generally we will call the vectors used for downscaling “reference vectors”. They will be vectors from a p-frame for temporal direct and vectors of previously coded blocks in the B-frame for “spatial direct”.