The present invention relates to an encoding method for the compression of a video sequence divided in groups of frames decomposed by means of a tridimensional (3D) wavelet transform leading to a given number of successive resolution levels, said method being based on a hierarchical subband encoding process called xe2x80x9cset partitioning in hierarchical treesxe2x80x9d (SPIHT) and leading from the original set of picture elements (pixels) of each group of frames to transform coefficients encoded with a binary format and constituting a hierarchical pyramid, said coefficients being ordered by means of magnitude tests involving the pixels represented by three ordered lists called list of insignificant sets (LIS), list of insignificant pixels (LIP) and list of significant pixels (LSP), said tests being carried out in order to divide said original set of picture elements into partitioning subsets according to a division process that continues until each significant coefficient is encoded within said binary representation, and a spatio-temporal orientation treexe2x80x94in which the roots are formed with the pixels of the approximation subband resulting from the 3D wavelet transform and the offspring of each of these pixels is formed with the pixels of the higher subbands corresponding to the image volume defined by these root pixelsxe2x80x94defining the spatio-temporal relationship inside said hierarchical pyramid.
In video compression schemes, the reduction of temporal redundancy is mainly achieved by two types of approaches. According to the first one, the so-called xe2x80x9chybridxe2x80x9d or predictive approach, a prediction of the current frame is computed based on the previously transmitted frames, and only the prediction error is intra-coded and transmitted. In the second one, the temporal redundancy is exploited by means of a temporal transform, which is similar to spatial techniques for removing redundancies. In this last technique, called the 3D or 2D+t approach, the sequence of frames is processed as a 3D volume, and the subband decomposition used in image coding is extended to 3D spatio-temporal data by using separable transforms (for example, wavelet or wavelet packets transforms implemented by means of filter banks). The anisotropy in the 3D structure can be taken into account by using different filter banks in the temporal and spatial directions (Haar filters are usually chosen for temporal filtering since the added delay observed with longer filters is undesirable; furthermore, Haar filters, which are two-tap filters are the only perfect reconstruction orthogonal filters which do not present the boundaries effect).
It was observed that the coding efficiency of the 3D coding scheme can be improved by performing motion estimation/compensation in the low temporal subbands, at each level of the temporal decomposition. Therefore, the present scheme includes motion estimation/compensation inside subbands and the 3D subband decomposition is applied on the compensated group of frames. An entire three-stage temporal decomposition is described in FIG. 1. Each group of frames in the input video sequence must contain a number of frames equal to a power of two (usually, 16, in the present example, 8). The rectilinear arrows indicate the low-pass (L) temporal filtering (continuous arrows) and the high-pass (H) one (dotted arrows), and the curved ones designate the motion compensation between two frames. At the last temporal decomposition level, there are two frames in the lowest temporal subband. In each frame of the temporal subbands, a spatial decomposition is performed. In this framework, subband coding the three-dimensional structure of data can be realized as an extension of the spatial subband coding techniques.
One of the most performant wavelet-based scheme for image compression, which was recently extended to the 3D structure of subbands is the bidimensional set partitioning in hierarchical trees, or 2D SPIHT, described in the document xe2x80x9cA new fast, and efficient image codec based on set partitioning in hierarchical treesxe2x80x9d, by A. Said and W. A. Pearlman, IEEE Transactions on Circuits and Systems for Video Technology, vol.6, No 3, June 1996, pp.243-250. The basic concepts used in this 3D coding technique are the following: spatio-temporal trees corresponding to the same location are formed in the wavelet domain; then, the wavelet transform coefficients in these trees are partitioned into sets defined by the level of the highest significant bit in a bit-plane representation of their magnitudes; finally, the highest remaining bit planes are coded and the resulting bits transmitted.
A common characteristic of the SPIHT algorithm presented above, as well in its 2D as in its 3D version is that the spatial, respectively the spatio-temporal, orientation trees are defined beginning with the lowest frequency subband, and represent the coefficients related to the same spatial, or spatio-temporal, location. This way, with the exception of the lowest frequency band, all parents have four (in 2D) or eight (in 3D) children. Let (i,j,k) represent the coordinates of a picture element (pixel) in the 3D transform domain: if it is not in the lowest spatio-temporal frequency subband and it is not in one of the last resolution level subbands, then its offsprings have the coordinates:
O={(2i,2j,2k), (2i+1,2j,2k), (2i,2j+1,2k), (2i,2j,2k+1), (2i+1,2j+1,2k), (2i+1,2j,2k+1), (2i,2j+1,2k+1), (2i+1,2j+1,2k+1)}.
For the sake of simplicity, the still picture case is illustrated in FIG. 2 (subbands s-LLLL, s-LLLH, etc . . . ).
In the image coding domain, compression algorithms by zerotrees were extensively studied in the last years and several improvements have been proposed. For example, in the MPEG-4 standard, a variant of such an algorithm (see for instance the document xe2x80x9cEmbedded image coding using zerotrees of wavelet coefficientsxe2x80x9d, by J. M. Shapiro, IEEE Transactions on Signal Processing, vol. 41, No 12, December 1993, pp.3445-3462) was adopted for the still picture coding mode, in which the lowest spatial subband is independently coded using a DPCM technique. Subsequently, spatial orientation trees are formed starting in the detail subbands (all subbands except s-LLLL, the first one), which is illustrated in FIG. 3.
It is an object of the invention to propose a new type of video encoding method, in the 3D case.
To this end, the invention relates to an encoding method such as defined in the introductive paragraph and which is moreover characterized in that:
(A) a vectorial differential pulse code modulation (DPCM) is used to separately encode the lowest frequency spatio-temporal subband, or approximation subband, according to the following conditions:
(a) a spatio-temporal predictor, using not only values at the same location in past frames of the video sequence but also neighbouring values in the current frame, is constructed for each vector of coefficients having components in each frame of the approximation subband, said vectorial coding feature coming from the fact that the lowest frequency subband contains spatial low frequency subbands from at least two frames;
(b) said DPCM uses constant prediction coefficients;
(B) the quantification of the prediction error is carried out by means of a scalar quantization of the two vector components, followed by an assignment of a unique binary code associated to the probability computed for each given couple of quantized values;
(C) the binary stream resulting from the steps (A) and (B) is encoded by a lossless process minimizing the entropy of the whole message. In another embodiment, the invention relates to a similar method, but characterized in that:
(A) a vectorial differential pulse code modulation (DPCM) is used to separately encode the lowest frequency spatio-temporal subband, or approximation subband, according to the following conditions:
(a) a spatio-temporal predictor, using not only values at the same location in past frames of the video sequence but also neighbouring values in the current frame, is constructed for each vector of coefficients having components in each frame of the approximation subband, said vectorial coding feature coming from the fact that the lowest frequency subband contains spatial low frequency subbands from at least two frames;
(b) said DPCM uses constant prediction coefficients;
(B) the quantification of the prediction error is carried out by means of a vectorial quantization using an optimal quantizer based on a generalized Lloyd-Max algorithm, a joint Laplacian probability density function for the two components of the quantized prediction error vector being considered for said optimization;
(C) the binary stream resulting from the steps (A) and (B) is encoded by a lossless process minimizing the entropy of the whole message.
Whatever the embodiment, said DPCM may also be adaptive, the coefficients of the spatio-temporal predictor now taking into account scene changes by means of a least means squares estimation of these coefficients for each group of frames.