In coding standards known as hybrid standards, for example MPEG-1, MPEG-2, MPEG-4, h264, as in the majority of 2D+t sub-band coding schemes, for example MC-EZBC (Motion Compensated Embedded Zero Block Context), the first step in the coding sequence consists in taking advantage of the temporal redundancy between successive images, before exploiting the spatial redundancy within an image.
FIG. 1 shows a video coder scheme according to the prior art.
The video signal is transmitted to a temporal analysis circuit 1. A motion estimation circuit 2 is connected to this first circuit in order to estimate the movement between two images received by the coder. The motion information is transmitted to the circuit 1 and to a coding circuit 6, for example in the form of motion vector fields. The output of the circuit 1 is transmitted to a spatial analysis circuit 3 that extracts the image frequency coefficients from the texture. These coefficients are subsequently quantified then coded by an entropy coding, circuit 4. This coded information and the motion information are transmitted to a packet generation circuit or packetizer 5 that sends the video data in the form of video packets which form the video data stream.
The temporal analysis circuit 1 performs a motion compensated temporal prediction in the case of a hybrid scheme or MCTF (Motion Compensated Temporal Filtering) in the case of a sub-band coding scheme. The coding algorithms with temporal prediction consist in applying motion compensation in order to generate prediction images which later will be used in the coding process. These algorithms are based on the same principle. The images to be coded are predicted starting from one or more previously coded images, called reference images. This is the case in the video MPEG standards with Predicted (P) images and Bi-directional or Bi-predicted (B) images. The prediction consists in performing a motion compensation using these reference images and motion vectors associated with the current image. What is subsequently coded is the residue of the prediction, in other words the difference between the current image and the temporal prediction image. The motion is generally described in blocks of pixels and the motion compensation effected by block.
The spatial analysis circuit 3 performs, for example, a decomposition into wavelets or a discrete cosine transform. The entropy coding of the circuit 4 can be a coding of the VLC (Variable Length Coding) type or a coding of the arithmetic type.
The function of the packetization circuit is to divide up the texture and motion information coming respectively from the entropy coding circuit and from the coding circuit for the motion fields into coherent sub-assemblies according to their spatial and temporal frequency and their importance, for example, their weight in a bit-plane coding approach. Thus, the binary stream obtained is independently scalable in resolution, in frame frequency and in fidelity.
The estimated motion fields correspond to the resolution of the source. The motion compensation step of the coder, and also its inverse in the decoder, whether done by filtering or prediction, must therefore be executed on full resolution images in order to be coherent.
Spatial scalability—the possibility of transmitting and therefore of reconstructing images at various levels of resolution, for example images in SD (Standard Definition), CIF or QCIF format—is currently often exploited in video data transmission. The conventional coding schemes by spatio-temporal analysis, such as that previously described using wavelet decomposition or a discrete cosine transform, lend themselves to such scalability. It does not however allow the motion information to be adapted in an optimal manner to this scalability, in other words to the various resolutions of the image, and hence the data compression to be optimized. A video coder that follows the architecture described can be spatially scalable for the texture, but not for the motion. And, this motion information is not negligible. As an example, it represents around 30% of the whole of the binary stream when a low-rate 15 Hz CIF sequence is encoded. The usual architectures therefore suffer from an over-definition of the motion information which substantially affects the compression performance at low resolution.
Solutions exist for preserving the scalability of both the texture and the motion. The simplest means is to estimate the latter at the lowest spatial resolution allowed for decoding. Hence, the spatial decomposition is initially carried out. The temporal redundancy existing between the successive spatial high frequencies then remains to be exploited. For this purpose, several solutions have been proposed which re-introduce conventional temporal decorrelation tools: prediction or motion compensated filtering. Now, these conventional techniques are less efficient in the transform domain than in the pixel domain because of the phase problem generating the phenomenon known as ‘shift-variance’ of spatial transforms. Indeed, both the discrete wavelet transform (DWT) and the discrete cosine transform (DCT) are such that successive image coefficients, corresponding to the same pixel pattern, can be very different in sign and in absolute value, depending on the direction and amplitude of the movement, the direction and length of the spatial filter. The shift-variance intrinsic to spatial transforms requires a new approach for motion estimation, since it makes the temporal high frequencies unsuitable for coding by prediction or filtering.