In video compression, scalability is the expected functionality to address the ever growing constraints of video transmission over heterogeneous networks (bandwidth, error rate . . . ) in terms of varying receiver capabilities and demands (CPU, display size, application). It indeed allows a progressive transmission of information (in layers, or not) in order to provide a quality level of the reconstructed video sequence that is proportional to the amount of information that is taken of the bitstream.
Although they have not been initially designed to address these issues, current standards tried to upgrade their video coding schemes in order to include this functionality. In quality or SNR scalable compression schemes, temporal and spatial resolutions are kept the same, but the image quality is intended to vary depending on how much of the bitstream is decoded. In practice, most standards provide SNR scalability by means of a layered structure without giving tip the classical single-scale scheme. The base layer (BL) is generally highly and efficiently compressed by a hybrid predictive encoding loop. The enhancement layer (EL) improves the quality of the compressed video signal by encoding the residual error (or prediction error), which is the difference between the original image and a reconstructed image. In MPEG-4 version 4, for instance, the EL uses DCT bit-planes to reencode this residual error.
The resulting scalability is however suboptimal for two main reasons. First, it is only based on an additional encoding of the prediction error and does not involve any refinement of the motion estimation and compensation processes. Moreover, it employs coding techniques like DCT that are not intrinsically designed to provide a progressive information transmission. For a more efficient scalable video coding, hierarchical strategies then appear to be promising candidates. The main idea is to design schemes that provide a generalized hierarchical representation of the information, opening the way to scalability. Schematically a simple hierarchical video coding scheme may be composed of several levels, each of which delivers a better-reconstructed image by means of a global refinement process (for instance, the hierarchy may use a pyramid composed of several image resolutions).
In parallel, hierarchical hybrid predictive coding schemes using non-block image representations like triangular meshes also constitute an interesting alternative. Meshes are well adapted to prediction error coding since they efficiently make the distinction between smooth regions and contours, well-compensated areas and occlusion regions. However, existing mesh-based methods generally encode the prediction error in a traditional way (block-based DCT, for example), i.e. by treating the error image as a whole picture without using the mesh employed during the motion estimation and compensation stages. These methods suffer from a lack of flexibility, especially at low bit-rates, and do not provide an embedded bitstream.
An alternative to DCT, better suited to low bit rate and tested in MEPG-4, is based on the so-called Matching Pursuit (MP) algorithm, described for instance in “Matching pursuits with time-frequency dictionaries”, by S. Mallat and Z. Zhang, IEEE Transactions on Signal Processing, vol.41, n°12, December 1993, pp.3397–3415. Indeed, MP is particularly well suited to the progressive texture encoding of arbitrarily shaped objects. Moreover, an intrinsic way of providing SNR scalability with MP is through the number of encoded “atoms”. MP naturally achieves scalability by encoding the motion prediction error in decreasing order of energy. The procedure is iteratively applied until either the bit budget is exhausted or the distortion falls down below a prespecified threshold. The granularity of MP is the coding cost of one atom, that is approximately 20 bits.
In the European patent application filed on Dec. 28th, 1999, under the filing number 99403307.4 (PHF99627), an MP prediction error coding method has been included inside a hierarchical mesh-based video coding scheme, which allows to benefit from the triangular mesh advantages concerning spatial adaptability, deformation capacity, compact and robust motion estimation even at low bit rates The mesh hierarchy of this scheme is obtained through a coarse-to-fine strategy, beginning at the first level with a coarse regular triangular mesh, and which then includes a mesh refinement process locally subdividing triangles of the current level where the prediction error signal is still important after motion compensation (the new mesh is taken as the input for the following level). Based on the MP algorithm, this method benefits from the mesh characteristics, while being especially designed to match the triangle support. Given any selected triangle, the issue is first to find the optimal strategy for atom positioning inside said triangle, resulting in a fast energy decrease of the error signal and a precise and smooth signal decomposition.
A first positioning method, of a geometrical type, results in a bit budget gain in comparison to the block-based approach for which each atom position has to be encoded. If this geometrical choice ensures that the atoms stay in the middle of the triangle, it results however in loosing the property of the MP with respect to the positioning freedom. By reusing the error energy information for atom center positioning, an atom coding efficiency more similar to the block-based approach is then obtained. This second implementation may still be improved by adding to it the possibility to orient one atom axis along the direction of the most important energy. A better atom positioning is thus obtained, the atom axes being aligned with the error signal that has to be approximated.
With respect to the method thus described, the triangular mesh-based video-coding scheme may be improved by a hierarchical representation. Hierarchy addresses the issue of finding optimal patch sizes and a tool for providing a description that is progressively refined from level to level (thus, allowing scalability). The hierarchy may be initialized to an arbitrary coarse mesh that is successively refined according to a specified criterion (energy for instance). The hierarchy used in the present case consists in combining a mesh grid with the image at each resolution, to the effect that the coarsest mesh is coupled to the lowest resolution image (here, the term resolution refers to a low-pass filtering that is performed on source images without any downsampling, and not to a decimation). Thus, image and mesh couples consist of elements that provide an information accuracy increasing with the level.
For instance, small triangles provide a precise motion modeling but are not well suited for large movements. On the contrary, the coarsest mesh allows global motion estimation. Then, propagating an updating this movement on refined meshes made of smallest triangles produces local optimization. Furthermore, no regularization or smoothing constraints are needed because of this mesh size control. FIG. 1 shows an example of mesh hierarchy, the image quality obviously evolving as said mesh hierarchy.
However, considering only the hierarchical feature of these tools, it appears that they do not provide scalability on their own. The reason is that motion estimation is performed at each hierarchy level between the same source images as for the first level.