Video transmission requires coding of the video in a form that allows its transmission. Typically, this involves effective compression due to the vast amount of information contained in a stream of pictures that constitute a video to be transmitted.
ITU-T H.263 is an International Telecommunications Union (ITU) video coding recommendation which specifies the bit-stream syntax and the decoding of a bit-stream. In this standard, pictures are coded using luminance and two colour difference (chrominance) components (Y, CB and CR). The chrominance components are each sampled at half resolution along both co-ordinate axes compared to the luminance component.
Each coded picture, as well as the corresponding coded bit stream, is arranged in a hierarchical structure with four layers being, from top to bottom, a picture layer, a picture segment layer, a macroblock (MB) layer and a block layer. The picture segment layer can be either a group of blocks layer or a slice layer.
The picture layer data contains parameters affecting the whole picture area and the decoding of the picture data. By default, each picture is divided into groups of blocks. A group of blocks (GOB) typically comprises a row of macroblocks (16 subsequential pixel lines) or a multiple thereof. Data for each GOB consist of an optional GOB header followed by data for MBs. Alternatively to GOBs, so called slices can be used, whereby each picture is divided into slices instead of GOBs. Data for each slice consists of a slice header followed by data for MBs.
The slices define regions within a coded picture. Each region is a number of MBs in a normal scanning order. There are no prediction dependencies across slice boundaries within the same coded picture. However, temporal prediction can generally cross slice boundaries unless ITU-T H.263 Annex R (Independent Segment Decoding) is used. Slices can be decoded independently from the rest of the picture data (except for the picture header). Consequently, slices improve error resilience in packet-lossy networks.
Each GOB or slice is divided into MBs. An MB relates to 16×16 pixels of luminance data and the spatially corresponding 8×8 pixels of chrominance data. In other words, an MB consists of four 8×8 luminance blocks and two spatially corresponding 8×8 chrominance blocks.
Rather than using regions formed of a number of MBs in the normal scan order, rectangular regions consisting of N×M macroblocks (N, M greater than or equal to one) and substituting slice and GOB structures were proposed to the ITU-T H.263 by Sen-ching Cheung, “Proposal on using Region Layer in H.263+”, ITU-T SG15 WP1 document LBC-96-213, July 1996. However, the proposal was not adopted for H.263.
In ITU-T H.263 Independent Segment Decoding mode (ITU-T H.263 Annex R), segment boundaries (as defined by the boundaries of the slices or the upper boundaries of the GOBs for which GOB headers are sent, or the boundaries of the picture, whichever bounds a region in the smallest way) are treated similarly to picture boundaries, which eliminate all error propagation from neighboring slices. For example, errors cannot be propagated due to motion compensation or de-blocking loop filtering from neighboring slices. Segment boundaries can only be changed at INTRA pictures, i.e. when no inter-coding is required.
The ISO/IEC standard draft 14496-2:1999(E), referred to as MPEG-4 visual or MPEG-4 video, is a standard draft that has a design centered around a basic unit of content called an audio-visual object (AVO). Examples of AVO's are a musician (in motion) in an orchestra, the sound generated by that musician, the chair she is sitting on, the (possibly moving) background behind the orchestra, and explanatory text for the current passage. In the MPEG-4 video, each AVO is represented separately and becomes the basis for an independent stream.
The coding of natural two-dimensional motion video is a part of the MPEG-4 video. MPEG-4 video is capable of coding both conventional rectangular video objects as well as arbitrarily shaped two-dimensional video objects. The basic video AVO is called a video object (VO). The VOs can be scalable, i.e. they may be split up, coded, and sent in two or more video object layers (VOL). One of these VOLs is called the base layer, which all terminals must receive in order to display any kind of video. The remaining VOLs are called enhancement layers, which may be expendable in case of transmission errors or restricted transmission capacity. In case of non-scalable video coding, one VOL per VO is coded.
A snapshot in time of a video object layer is called a video object plane (VOP). For a rectangular video, this corresponds to a picture or a frame. However, in general, the VOPs can have an arbitrary shape. Each VOP can be divided into video packets. Each VOP and video packet is further divided into macroblocks similarly to ITU-T H.263. The colour (YUV) information of the macroblock is coded similarly to ITU-T H.263, i.e., the macroblock is further divided into 8×8 blocks. In addition, if the VOP has an arbitrary shape, the shape of the macroblock is coded as explained in the next paragraph.
The MPEG-4 video VOs may be of any shape, and furthermore the shape, size, and position of the object may vary from one frame to the next. In terms of its general representation, a video object is composed of three colour components (YUV) and an alpha component. The alpha component defines the object's shape on a picture-by-picture basis. Binary objects form the simplest class of objects. They are represented by a sequence of binary alpha maps, i.e. 2-dimensional pictures where each pixel is either black or white. MPEG-4 video provides a binary shape only mode for compressing these objects. The compression process is defined exclusively by a binary shape encoder for coding the sequence of alpha maps. In addition to binary objects, a grey-level alpha map can be used to define the opacity of the object. The object boundary is coded using a binary alpha map, while the grey-level alpha information is coded similarly to texture coding using the DCT transform. In addition to the sequence of object shape and opacity definitions, the representation comprises the colours of all the pixels within the interior of the object shape. MPEG-4 video encodes these objects using a binary shape encoder and then a motion compensated discrete cosine transform (DCT)-based algorithm for the interior texture coding.
It is also known to be advantageous to segment a video bit-stream into portions of different priorities, for example by scalable video coding, data partitioning, or region-based coding discussed above.
Scalable video coding and data partitioning suffer, however, from dependencies between different coding elements. An enhancement layer, for example, cannot be decoded correctly if the base layer has not been received correctly. Correspondingly, a low-priority partition is of no use if the corresponding high-priority partition has not been received. This makes the use of scalable video coding and data partitioning disadvantageous in some cases. Scalable coding and data partitioning do not provide means to handle spatial regions of interest differently from subjectively less important areas. Moreover, many forms of scalable coding, such as conventional signal to noise ratio (SNR) and spatial scalability, suffer from a worse compression efficiency compared to non-scalable coding. In the region-based video coding, on the other hand, the GOBs or slices may contain macroblocks of different subjective importance. Thus, no prioritisation of GOBs and slices is typically possible.
Coding of arbitrarily shaped objects is currently considered too complex for handheld devices. This is further exemplified by the fact that MPEG-4 video shape coding tools are typically excluded from mobile video communication services of the planned third generation mobile telephones.