The present invention relates to the field of video encoders and, more particularly, to an encoding method based on an adaptation of the so-called matching pursuit method (MP) and provided to be implemented in a signal-to-noise ratio (SNR) scalable video encoder.
In the multimedia domain, new interactive applications such as Internet video streaming, video database browsing or multi-quality video services are becoming widespread. This leads to an expansion of video services over heterogeneous networks (the Internet, mobile nets and In Home Digital Networks), which has raised new issues in terms of varying transport conditions (bandwidth, error rate . . . ) as well as varying consumer demands and terminal decoding capabilities (CPU, display size, application . . . ). Consequently, there is a need for new flexible and intrinsically robust video compression systems which can handle these new constraints, especially at very low bit rates. The notion of scalability is the expected functionality to address these issues. For instance, only one part of an embedded bitstream may be sent to a particular user so as to avoid network congestion, or the entire bitstream can be partially decoded according to the targeted level of quality. Current standards such as MPEG-4 are working toward video coding schemes that provide this functionality.
MPEG-4 lies on a classical block-based predictive hybrid scheme in which the prediction error coding after motion compensation is realized by means of a Discrete Cosine Transform (DCT). However, DCT appears to suffer from very annoying visual artifacts at low bit rates (especially block edge artifacts), because the number of coded coefficients becomes very low and the quantization is coarse. An alternative to DCT, better suited to low bit rate and tested in MEPG-4, is based on the so-called Matching Pursuit (MP) algorithm, described for instance in xe2x80x9cMatching pursuits with time-frequency dictionariesxe2x80x9d, by S. Mallat and Z. Zhang, IEEE Transactions on Signal Processing, vol.41, nxc2x012, December 1993, pp.3397-3415.
As described in said document, MP is a time-frequency transform that expands the motion residual signal in an iterative manner, using an overcomplete dictionary of bidimensional (2D) functions called atoms. This algorithm iteratively calculates the atom minimizing the mean-squared error of the signal that has to be encoded. The multistage algorithm proceeds according to the following steps:
the highest-energy image block on a rectangular search window is computed; around this block, the projection that maximizes the inner product between the basis functions and the residual image at the current iteration is calculated: a high inner product value indicates that the corresponding atom fits the signal structure;
this weighted atom is then subtracted from the current error image to produce the new residual image for the next stage.
The method complexity comes from the fact that the algorithm searches the best atom among all atoms in the dictionary at each iteration. Table 1 shows the atom dictionary that is traditionally used in the classical block-based Matching Pursuit implementation (the parameters Si, Fi and Pi respectively designating the scale, the frequency and the phase of the atom):
The redundancy of the dictionary allows a larger degree of freedom in the choice of the signal decomposition. It is all the more interesting in the case of low bit rate coding, since MP only employs the required functions to represent accurately the signal, thus resulting in a better compression efficiency. Moreover, MP is iterative, and the residual energy therefore decreases at each iteration. The procedure is iterated until this residual energy falls down below a given threshold or the available amount of bits is exhausted. The representation is adaptive in the sense that the most significant coefficients are first extracted. Therefore, MP encodes the residual error in decreasing order of importance, which is attractive for quality (SNR) scalability. Additional coefficients corresponding to finer details in the signal structure may be generated during the next step of the algorithm.
The dictionary is generally composed of bidimensional (2D) separable Gabor functions represented by parameters governing the scale of a modulated Gaussian function, and the frequency and phase of a modulating function. The choice of the dictionary is important since it is strongly related to the convergence speed of the decomposition and thus to the efficiency of the algorithm. Gabor functions are chosen for their smooth decay at edges, which limits visual artifacts. Moreover, atoms may be positioned at any pixel location of the entire error image in order to match the signal structures. Atom overlapping is enabled, which increases the expansion smoothness.
Although the superiority of this MP method over the DCT has been clearly shown by several studies, the DCT remains the only texture coding method used in MPEG-4 (the reason is that the DCT is a very well mature method, for which many software and hardware optimizations exist). Moreover, MP is an iterative algorithm, for which no parallelism techniques exist and which is complex at the encoder side. However, MP seems well suited to texture description inside arbitrarily shaped objects and provides a progressive representation of information in an embedded bitstream, thus allowing scalability.
As current video standards have not been originally designed to provide scalability (they have been upgraded so as to include this new functionality while preserving their existing architecture), this scalability is generally achieved through a layer structure that delivers two or more separated bitstreams. This approach suffers from a lack of flexibility and a low coding efficiency of the enhancement layer. Moreover, the basic types of scalability (spatial, temporal, SNR) cannot be reached at the same time. As a result, ITU and MPEG committee experts are looking, in the domain of very low bit rate, for fully scalable video coding schemes that allow a greater interactivity. A promising approach deals with the use of a triangular mesh-based representation (i.e. a subdivision of the image domain into elementary patches, most currently triangular because this configuration is very flexible) for object-based compression and composition. Such an operation aims at defining homogeneous regions with a limited number of elements, according to the targeted accuracy of the scene description, and can even be hierarchical. The hierarchy is then initialized to an arbitrary coarse mesh that is successively refined according to a specific criterion, thus allowing a description whose accuracy increases from level to level.
It has then be proposed to exploit the main advantages provided by meshes: spatial adaptability for efficient object contour encoding, deformation capacity for object tracking, compact and robust motion estimation even at low bit rates without the visual artifacts generally encountered with other representations. Triangular meshes are attractive for functionalities such as texture animation and deformation, content-based retrieval of video-objects, augmented reality, transfiguration, etc. Moreover, their extension to hierarchical meshes is a possible response to scalability. However, only few known works are related to the subject of texture coding inside meshes, these existing studies being based on adaptations of the DCT, originally designed only for rectangular decompositions.
It is therefore an object of the invention to propose a video encoding method that uses and adapts the MP algorithm to a texture error coding inside triangular meshes, in order to benefit from MP features, namely a progressive information transmission in decreasing order of importance and a smooth signal representation even at low bit rates.
To this end, the invention relates to based on an adaptation of the so-called matching pursuit method (MP) in which said error texture coding process inside triangular meshes is restricted, at each iteration of said MP method, to a specific triangle selected according to a predefined criterion and with respect to which the optimal strategy for atom positioning inside this triangle, in view of a fast energy decrease of the error signal, consists in arbitrarily selecting the geometrical barycenter of the selected triangle as the center of the atoms, no additional information concerning atom positions being transmitted to the decoder.
The proposed solution provides the targeted video-coding scheme with a prediction error coding tool well suited to mesh-based coding. Based on the MP algorithm, this method benefits from the mesh characteristics, while being especially designed to match the triangle support. Given any selected triangle, the issue is to find the optimal strategy for atom positioning inside said triangle, resulting in a fast energy decrease of the error signal and a precise and smooth signal decomposition. This geometrical positioning method results in a bit budget gain in comparison to the block-based approach for which each atom position has to be encoded.
In another embodiment, the invention relates to an encoding method based on an adaptation of the so-called matching pursuit method (MP) in which said error texture coding process inside triangular meshes is restricted, at each iteration of said MP method, to a specific triangle selected according to a predefined criterion and with respect to which the optimal strategy for atom positioning inside this triangle, in view of a fast energy decrease of the error signal, consists in re-using the error energy information for atom center positioning, the Displaced Frame Difference (DFD) energy barycenter inside the selected triangle that has the highest energy being then arbitrarily chosen as the center of the atoms, and this position inside the triangle being an additional information transmitted to the decoder.
If the geometrical choice (corresponding to the first embodiment) ensures that the atoms stay in the middle of the triangle, it result however in loosing the property of the MP with respect to the positioning freedom. By re-using the error energy information for atom center positioning, an atom coding efficiency more similar to the block-based approach is obtained.
This implementation may still be improved by adding to it the possibility to orient one atom axis along the direction of the most important energy. A better atom positioning is thus obtained, the atom axes being aligned with the error signal that has to be approximated.