Transmission of moving pictures in real-time is employed in several applications like e.g. video conferencing, net meetings, TV broadcasting and video telephony.
Digital video is typically described by representing each pixel in a picture with 8 bits (1 Byte) or more. Such uncompressed video data results in large bit volumes, and cannot be transferred over conventional communication networks and transmission lines in real time due to bandwidth limitations.
Thus, enabling real time video transmission requires a large extent of data compression. Data compression may, however, compromise with picture quality. Therefore, great efforts have been made to develop compression techniques allowing real time transmission of high quality video over bandwidth limited data connections.
In video compression systems, the main goal is to represent the video information with as little capacity as possible. Capacity is defined with bits, either as a constant value or as bits/time unit. In both cases, the main goal is to reduce the number of bits. For delay and processor resource concerns, it is also important to keep the processing time and consumption on a minimum.
The most common video coding method is described in the MPEG* and H.26* standards. The video data undergo four main processes before transmission, namely prediction, transformation, quantization and entropy coding.
The prediction process significantly reduces the amount of bits required for each picture in a video sequence to be transferred. It takes advantage of the similarity of parts of the sequence with other parts of the sequence. Since the predictor part is known to both encoder and decoder, only the difference has to be transferred. This difference typically requires much less capacity for its representation. The prediction is mainly based on vectors representing movements. The prediction process is typically performed on square block sizes (e.g. 16×16 pixels). Encoders based on motion vectors are often referred to as motion-based encoders.
Note that in some cases, like in H.264/AVC predictions of pixels based on the adjacent pixels in the same picture rather than pixels of preceding pictures are used. This is referred to as intra prediction, as opposed to inter prediction.
The residual represented as a block of data (e.g. 4×4 pixels) still contains internal correlation. A well-known method of taking advantage of this is to perform a two dimensional block transform. In H.263 an 8×8 Discrete Cosine Transform (DCT) is used, whereas H.264 uses a 4×4 integer type transform. This transforms 4×4 pixels into 4×4 transform coefficients and they can usually be represented by fewer bits than the pixel representation. Transform of a 4×4 array of pixels with internal correlation will most probably result in transform coefficients much more suited for further compression than the original 4×4 pixel block.
A macro block is a part of the picture consisting of several sub blocks for luminance (luma) as well as for chrominance (chroma).
There are typically two chrominance components (Cr, Cb) with half the resolution both horizontally and vertically compared with luminance.
This format is in some contexts denoted as YUV 4:2:0. The abbreviation is not very self-explanatory. It means that the chrominance has half the resolution of luminance horizontally as well as vertically. For the conventional video format CIF, this means that a luminance frame has 352×288 samples whereas each of the chrominance components has 176×144 samples.
This is in contrast to for instance RGB (red, green, blue) which is typically the representation used in the camera sensor and the monitor display. FIG. 1 illustrates a typical denotation and grouping of pixels in a macroblock for luminance and chrominance, respectively. The macroblock consists of 16×16 luminance pixels and two chrominance components with 8×8 pixels each. Each of the components is here further broken down into 4×4 blocks, which are represented by the small squares. For coding purposes, both luma and chroma 4×4 blocks are grouped together in 8×8 sub blocks and designated Y0-Y3 and Cr, Cb.
In digital video applications such as video conferencing, large parts of the image often do not change considerably between consecutive frames. From the perspective of a motion-based encoder as described above, this means that many macro-blocks often do not differ considerably from their reference macro-blocks, i.e. the previous macro-blocks. Thus, the motion vectors for those blocks are zero. Yet, for such an encoder to conclude that a macro-block is unchanged or that it has indeed changed but by such small an amount that the residual after motion compensation falls below the quantization threshold, it still has to read all the data of the new macro-block and compare it to a reference macro-block.
FIG. 1 is block diagram of a typical frame buffer arrangement inside a camera between the camera sensor and the encoder. The sensor consecutively feeds an image processing part with raw pixel data. The last frame is at all times residing in the new frame buffer, and the preceding frame is stored in the reference frame buffer.
As an encoder would typically store both data for the current frame and the reference frame in off-chip memory, it would be advantageous if the encoder knew beforehand that a particular macroblock is similar to the reference macroblock. Then it would not have to do the comparison with the reference macroblock, and more importantly it would not have to access these blocks of the new frame buffer in the camera. For cache-based encoders, additionally less cache thrashing would occur.
Also, for a real-time encoder implementation it might be advantageous for the encoder to know how many of the blocks that have changed compared to the reference frame without having to read the actual data. With that information, the encoder can more easily optimize quality within its given processing power limitations.