Transmission of moving pictures in real-time is employed in several applications such as video conferencing, net meetings, TV broadcasting and video telephony.
However, representing moving pictures requires bulk information as digital video typically is described by representing each pixel in a picture with 8 bits (1 Byte). Such uncompressed video data results in large bit volumes, and cannot be transferred over conventional communication networks and transmission lines in real time due to limited bandwidth.
Thus, enabling real time video transmission requires a large extent of data compression. Data compression may, however, compromise the picture quality. Therefore, great efforts have been made to develop compression techniques allowing real time transmission of high quality video over bandwidth limited data connections.
Many video compression standards have been developed over the last 20 years. Many of those methods are standardized through ISO (the International Standards organization) or ITU (the International Telecommunications Union). In addition, a number of other proprietary methods have been developed. The main standardized methods are:
ITU: H.261, H.262, H.263, H.264
ISO: MPEG1, MPEG2, MPEG4/AVC)
In video compression systems, the main goal is to represent the video information with as little capacity as possible. Capacity is defined with bits, either as a constant value or as bits/time unit. In both cases, the main goal is to reduce the number of bits.
The first step in the coding process according to these standards is to divide the picture into square blocks of pixels, for instance 16×16. These blocks are typically denoted as Macroblocks (MB). This is done for luminance information as well as for chrominance information. A scanning order of the MBs is established. A scanning order defines the encoding/decoding order of the MBs in a picture. A raster scan is typically used. This means that MBs are scanned as MB-lines from left to right and then the MB-lines from top to bottom. A raster scan order is illustrated in FIG. 1.
The following prediction process significantly reduces the amount of bits required for each picture in a video sequence to be transferred. It takes advantage of the similarity of parts of the sequence with other parts of the sequence, and produces a prediction for the pixels in the block. This may be based on pixels in an already coded/decoded picture (called inter prediction) or on already coded/decoded pixels in the same picture (intra prediction).
The prediction is mainly based on vectors representing movements. In a typical video sequence, the content of a present block M would be similar to a corresponding block in a previously decoded picture. If no changes have occurred since the previously decoded picture, the content of M would be equal to a block of the same location in the previously decoded picture. In other cases, an object in the picture may have been moved so that the content of M is more equal to a block of a different location in the previously decoded picture. Such movements are represented by motion vectors (V). As an example, a motion vector of (3;4) means that the content of M has moved 3 pixels to the left and 4 pixels upwards since the previously decoded picture. For improved accuracy, the vector may also include decimals, requiring interpolation between the pixels. Since the predictor part is known to both encoder and decoder, only the difference has to be transferred. This difference typically requires much less capacity for its representation. The difference between the pixels to be coded and the predicted pixels is often referred to as a residual.
The residual represented as a block of data (e.g. 4×3 pixels) still contains internal correlation. A well-known method of taking advantage of this is to perform a two dimensional block transform. In H.263 an 8×8 Discrete Cosine Transform (DCT) is used, whereas H.264 uses a N×N (where N can be 4 or 8) integer type transform. This transforms N×N pixels into N×N transform coefficients and they can usually be represented by fewer bits than the pixel representation. Transform of a N×N array of pixels with internal correlation will probably result in a 4×3 block of transform coefficients with much fewer non-zero values than the original 4×3 pixel block.
Direct representation of the transform coefficients is still too costly for many applications. A quantization process is carried out for a further reduction of the data representation. Hence the transform coefficients undergo quantization. A simple version of quantisation is to divide parameter values by a number—resulting in a smaller number that may be represented by fewer bits. This is the major tool for controlling the bit production and reconstructed picture quality. It should be mentioned that this quantization process has as a result that the reconstructed video sequence is somewhat different from the uncompressed sequence. This phenomenon is referred to as “lossy coding”. This means that the reconstructed pictures typically have lower quality than the original pictures. The output from the quantization process is integer numbers—which do not represent the original transform coefficients correctly. These integers together with integers representing the side information is coded in a lossless way and transmitted to the decoder.
Finally, a scanning within the MBs of the two dimensional transform coefficient data into a one dimensional set of data is performed, and the one dimensional set is further transformed according to an entropy coding scheme. Entropy coding implies lossless representation of the quantized transform coefficients. In depicting the transform coefficients it is common to position the low frequency coefficient (or DC coefficient) in the upper left corner. Then the horizontal and vertical spatial frequency increase to the right and down. The scanning usually starts with the coefficient in the left upper corner and follows a zig-zag pattern around the diagonal direction towards the lower right corner of the MB, but in other cases the entropy coding may be more efficient if “inverse scanning” (high to low frequency) is used.
The above steps are listed in a natural order for the encoder. The decoder will to some extent perform the operations in the opposite order and do “inverse” operations as inverse transform instead of transform and de-quantization instead of quantization.
In connection with the introduction of video formats of higher resolution in video conferencing, an increasing number of pixels will represent a picture of the same physical segment. Hence, an image section will contain an increasing number of 16×16 MB. The probability of many adjacent MBs having the same characteristics, e.g. motion vectors and zero-transforms, is therefore increasing, and consequently also redundant data representation.