1. Field of the Invention
The present invention relates to an encoding method for the compression of a video sequence divided, in frames or groups of frames decomposed by means of a wavelet transform leading to a given number of successive resolution levels that correspond to the decomposition levels of said transform, said encoding method being based on the hierarchical subband encoding process called xe2x80x9cset partitioning in hierarchical treesxe2x80x9d (SPIHT) and leading from the original set of picture elements of the video sequence to wavelet transform coefficients encoded with a binary format, said coefficients being organized in trees and ordered into partitioning sets corresponding to respective levels of significance, said sets being defined by means of magnitude tests leading to a classification of the significance information in three ordered lists called list of insignificant sets (LIS), list of insignificant pixels (LIP) and list of significant pixels (LSP), and said tests being carried out in order to divide said original set of picture elements into said partitioning sets according to a division process that continues until each significant coefficient is encoded within said binary representation.
2. Description of the Related Art
Classical video compression schemes may be considered as comprising four main modules: motion estimation and compensation, transformation in coefficients (for instance, discrete cosine transform or wavelet decomposition), quantification and encoding of the coefficients, and entropy coding. When a video encoder has moreover to be scalable, this means that it must be able to encode images from low to high bit rates, increasing the quality of the video with the rate. By naturally providing a hierarchical representation of images, a transform by means of a wavelet decomposition appears to be more adapted to scalable schemes than the conventional discrete cosine transform (DCT).
A wavelet decomposition allows an original input signal to be described by a set of subband signals. Each subband represents in fact the original signal at a given resolution level and in a particular frequency range. This decomposition into uncorrelated subbands is generally implemented by means of a set of monodimensional filter banks applied first to the lines of the current image and then to the columns of the resulting filtered image. An example of such an implementation is described in xe2x80x9cDisplacements in wavelet decomposition of imagesxe2x80x9d, by S. S. Goh, Signal Processing, vol. 44, nxc2x01, June 1995, pp.27-38. Practically two filtersxe2x80x94a low-pass one and a high-pass onexe2x80x94are used to separate low and high frequencies of the image. This operation is first carried out on the lines and followed by a sub-sampling operation, by a factor of 2, and then carried out on the columns of the sub-sampled image, the resulting image being also down-sampled by 2. Four images, four times smaller than the original one, are thus obtained: a low-frequency sub-image (or xe2x80x9csmoothed imagexe2x80x9d), which includes the major part of the initial content of the concerned original image and therefore represents an approximation of said image, and three high-frequency sub-images, which contain only horizontal, vertical and diagonal details of said original image. This decomposition process continues until it is clear that there is no more useful information to be derived from the last smoothed image.
A technique rather computationally simple for image compression, using a two-dimensional (2D) wavelet decomposition, is described in xe2x80x9cA new, fast, and efficient image codec based on set partitioning in hierarchical trees (=SPIHT)xe2x80x9d, by A. Said and W. A. Pearlman, IEEE Transactions on Circuits and Systems for Video Technology, vol.6, nxc2x03, June 1996, pp.243-250, As explained in said document, the original image is supposed to be defined by a set of pixel values p(x,y), where x and y are the pixel coordinates, and coded by a hierarchical subband transformation, represented by the following formula (1):
c(x,y)=xcexa9(p(x,y))xe2x80x83xe2x80x83(1) 
where xcexa9 represents the transformation and each element c(x,y) is called xe2x80x9ctransform coefficient for the pixel coordinates (x,y)xe2x80x9d.
The major objective is then to select the most important information to be transmitted first, which leads to order these transform coefficients according to their magnitude (coefficients; with larger magnitude have a larger content of information and should be transmitted first, or at least their most significant bits). If the ordering information is explicitly transmitted to the decoder, images with a rather good quality can be recovered as soon as a relatively small fraction of the pixel coordinates are transmitted. If the ordering information is not explicitly transmitted, it is then supposed that the execution path of the coding algorithm is defined by the results of comparisons on its branching points, and that the decoder, having the same sorting algorithm, can duplicate this execution path of the encoder if it receives the results of the magnitude comparisons. The ordering information can then be recovered from the execution path.
One important fact in said sorting algorithm is that it is not necessary to sort all coefficients, but only the coefficients such that 2nxe2x89xa6|cx,y| less than 2n+1, with n decremented in each pass. Given n, if |cx,y|xe2x89xa72n (n being called the level of significance), it is said that a coefficient is significant; otherwise it is called insignificant. The sorting algorithm divides the set of pixels into partitioning subsets Tm and performs the magnitude test (2):                                           max                                          (                                  x                  ,                  y                                )                            ∈                              T                m                                              ⁢                      {                          "LeftBracketingBar"                              c                                  x                  ,                  y                                            "RightBracketingBar"                        }                          ≥                              2            n                    ⁢                      xe2x80x83                    ?                                    (        2        )            
If the decoder receives a xe2x80x9cnoxe2x80x9d (the whole concerned subset is insignificant), then it knows that all coefficients in this subset Tm are insignificant. If the answer is xe2x80x9cyesxe2x80x9d (the subset is significant), then a predetermined rule shared by the encoder and the decoder is used to partition Tm into new subsets Tm,l, the significance test being further applied to these new subsets. This set division process continues until the magnitude test is done to all single coordinate significant subsets in order to identify each significant coefficient and to allow to encode it with a binary format.
To reduce the number of transmitted magnitude comparisons (i.e. of message bits), one may define a set partitioning rule that uses an expected ordering in the hierarchy defined by the subband pyramid. The objective is to create new partitions such that subsets expected to be insignificant contain a large number of elements, and subsets expected to be significant contain only one element. To make clear the relationship between magnitude comparisons and message bits, the following function is used:                                           S            n                    ⁡                      (            T            )                          =                  {                                                                      1                  ,                                                                                                                                                max                                                                              (                                                          x                              ,                              y                                                        )                                                    ∈                          T                                                                    ⁢                                              {                                                  "LeftBracketingBar"                                                      c                                                          x                              ,                              y                                                                                "RightBracketingBar"                                                }                                                              ≥                                          2                      n                                                        ,                                                                                                      0                  ,                                                                              otherwise                  ,                                                                                        (        3        )            
to indicate the significance of a subset of coordinates T.
Furthermore, it has been observed that there is a spatial self-similarity between subbands. The coefficients are expected to be better magnitude-ordered if one moves downward in the pyramid following the same spatial orientation. For instance, if low-activity areas are expected to be identified in the highest levels of the pyramid, then they are replicated in the lower levels at the same spatial locations, but with a higher resolution. A tree structure, called spatial orientation tree, naturally defines the spatial relationship in the hierarchical subband pyramid of the wavelet decomposition. FIG. 1 shows how the spatial orientation tree is defined in a pyramid constructed with recursive four-subband splitting. Each node of the tree corresponds to the pixels of the same spatial orientation in the way that each node has either no offspring (the leaves) or four offspring, which always form a group of 2xc3x972 adjacent pixels. In FIG. 1, the arrows are oriented from the parent node to its offspring. The pixels in the highest level of the pyramid are the tree roots and are grouped in 2xc3x972 adjacent pixels. However, their offspring branching rule is different, and in each group, one of them (indicated by the dot in FIG. 1) has no descendant.
The following sets of coordinates are used to present this coding method, (x,y) representing the location of the coefficient:
0(x,y): set of coordinates of all offspring of node (x,y);
D(x,y): set of coordinates of all descendants of the node (x,y);
H: set of coordinates of all spatial orientation tree roots (nodes in the highest pyramid level);
L(x,y)=D(x,y)xe2x88x920(x,y). 
As it has been observed that the order in which the subsets are tested for significance is important, in a practical implementation the significance information is stored in three ordered lists, called list of insignificant sets (LIS), list of insignificant pixels (LIP), and list of significant pixels (LSP). In all these lists, each entry is identified by coordinates (x,y), which in the LIP and LSP represent individual pixels, and in the LIS represent either the set D(x,y) or L(x,y) (to differentiate between them, a LIS entry may be said of type A if it represents D(x,y), and of type B if it represents L(x,y)). The SPIHT algorithm is in fact based on the manipulation of the three lists LIS, LIP and LSP.
For the entropy coding module, the arithmetic coding technique is more effective in video compression than the Huffmann encoding owing to the following reasons: the obtained codelength is very close to the optimal length, the method particularly suits adaptive models (the statistics of the source are estimated on the fly), and it can be split into two independent modules (the modeling one and the coding one). The following description relates mainly to modeling, which involves the determination of certain source-string events and their context, and the way to estimate their related statistics.
The context is intended to capture the redundancies of the entire set of source strings under consideration. In an original video sequence, the value of a pixel indeed depends on those of the pixels surrounding it. After the wavelet decomposition, the same property of xe2x80x9cgeographicxe2x80x9d interdependency holds in each subband. If the coefficients are sent in an order that preserves these dependencies, it is possible to take advantage of the xe2x80x9cgeographicxe2x80x9d information in the framework of universal coding of bounded memory tree sources, as described for instance in the document xe2x80x9cA universal finite memory sourcexe2x80x9d, by M. J. Weinberger and al., IEEE Transactions on Information Theory, vol.41, nxc2x03, May 1995, pp.643-652. A finite memory tree source has the property that the next symbol probabilities depend on the actual values of the most recent symbols. Binary sequential universal source coding procedures for finite memory tree sources often make use of context tree which contains for each string (context) the number of occurrences of zeros and ones given the considered context. This tree allows to estimate the probability of a symbol, given the d previous bits:
{circumflex over (P)}(Xn|xnxe2x88x921 . . . xnxe2x88x92d), where xn is the value of the examined bit and xnxe2x88x921 . . . xnxe2x88x92d represents the context, i.e. the previous sequence of d bits. This estimation turns out to be a difficult task when the number of conditioning events increases because of the context dilution problem or the model cost.
One way to solve this problem is the context-tree weighting method, detailed in xe2x80x9cThe context-tree weighting method: basic propertiesxe2x80x9d, by F. M. J. Willems and al., IEEE Transactions on Information Theory, vol.41, nxc2x03, May 1995, pp.653-664. The principle of this method is to estimate weighted probabilities using the most efficient context for the examined bit. Indeed, sometimes it can be better to use shorter contexts to encode a bit (if the last bits of the context have no influence on the current bit, they might not be taken into account). This technique reduces the length of the final code. The determination of efficient models and contexts is therefore a crucial stage in arithmetic encoding.
The 2D SPIHT algorithm, which mainly consists in comparing a set of pixels corresponding to the same image area at different resolutions to the value previously called xe2x80x9clevel of significancexe2x80x9d, is based on a key concept: the prediction of the absence of significant information across scales of the wavelet decomposition by exploiting self-similarity inherent in natural images. This means that if a coefficient is insignificant at the lowest scale of wavelet decomposition, the coefficients corresponding to the same area at the other scales have great chances to be insignificant too. Unfortunately, the SPIHT algorithm, which exploits the redundancy between the subbands, xe2x80x9cdestroysxe2x80x9d the dependencies between neighboring pixels inside each subband.
It is therefore a first object of the present invention to improve the scanning order in the SPITH algorithm in order to reestablish the relations of neighborhood between pixels of the same subband.
To this end, the invention relates to an encoding method for the compression of a video sequence including successive frames, each frame being decomposed by means of a two dimensional (2D) wavelet transform leading to a given number of successive resolution levels corresponding to the decomposition levels of said transform, said encoding method being based on the hierarchical subband encoding process called xe2x80x9cset partitioning in hierarchical treesxe2x80x9d (SPIHT) and leading from the original set of picture elements of the video sequence to wavelet transform coefficients encoded with a binary format, said coefficients being organized into spatial orientation trees rooted in the lowest frequency, or spatial approximation, subband and completed by an offspring in the higher frequency subbands, the coefficients of said trees being further ordered into partitioning sets corresponding to respective levels of significance and defined by means of magnitude tests leading to a classification of the significance information in three ordered lists called list of insignificant sets (LIS), list of insignificant pixels (LIP) and list of significant pixels (LSP), said tests being carried out in order to divide said original set of picture elements into said partitioning sets according to a division process that continues until each significant coefficient is encoded within said binary representation, said method being further characterized in that it comprises the following steps:
(A) an initialization step, in which, each pixel having coordinates (x,y) varying from 0 to size_x and from 0 to size_y respectively, said list LIS is then initialized with the coefficients of said spatial approximation subband, excepting the coefficient having the coordinates x=0(mod 2) and y=0(mod 2), the initialization order of the LIS being the following:
(a) put in the list all the pixels that verify x=1(mod.2) and y=0(mod.2), for the luminance component Y and then for the chrominance components U and V;
(b) put in the list all the pixels that verify x=1(mod.2) and y=1(mod.2), for Y and then for U and V;
(c) put in the list all the pixels that verify x=0(mod.2) and y=1(mod.2), for Y and then for U and V;
(B) an exploration step, in which the spatial orientation trees defining the spatial relationship in the hierarchical subband pyramid of the wavelet decomposition are explored from the lowest resolution level to the highest one, while keeping neighboring pixels together and taking account of the orientation of the details, said exploration of the offspring coefficients being implemented thanks to a zig-zag scanning order of the offspring coefficients that is shown in FIG. 7, in the case of horizontal and diagonal detail subbands, for a group of four offspring and the passage of said group to the next one in the horizontal direction, in FIG. 8 for a group of four offspring and the passage of said group to the next one in the vertical direction, and in FIGS. 9 and 10 respectively for the lowest resolution level and for the finer resolution levels.
It is another object of the invention to implement a similar principle in the case of a 3D SPIHT algorithm.
To this end, the invention relates to an encoding method for the compression of a video sequence including successive groups of frames, each group of frames being decomposed by means of a three-dimensional (3D) wavelet transform leading to a given number of successive resolution levels corresponding to the decomposition levels of said transform, said encoding method being based on the hierarchical subband encoding process called xe2x80x9cset partitioning in hierarchical treesxe2x80x9d (SPIHT) and leading from the original set of picture elements of the video sequence to wavelet transform coefficients encoded with a binary format, said coefficients being organized into spatio-temporal orientation trees rooted in the lowest frequency, or spatio-temporal approximation, subband and completed by an offspring in the higher frequency subbands, the coefficients of said trees being further ordered into partitioning sets corresponding to respective levels of significance and defined by means of magnitude tests leading to a classification of the significance information in three ordered lists called list of insignificant sets (LIS), list of insignificant pixels (LIP) and list of significant pixels (LSP), said tests being carried out in order to divide said original set of picture elements into said partitioning sets according to a division process that continues until each significant coefficient is encoded within said binary representation, said method being further characterized in that it comprises the following steps:
(A) an initialization step, in which the spatio-temporal approximation subband that results from the 3D wavelet transform contains the spatial approximation subbands of the two frames in the temporal approximation subband, indexed by z=0 and z=1, and, each pixel having coordinates (x,y,z) varying for x and y from 0 to size_x and from 0 to size_y respectively, said list LIS is then initialized with the coefficients of said spatio-temporal approximation subband, excepting the coefficient having the coordinates of the form z=0(mod 2), x=0(mod 2) and y=0(mod 2), the initialization order of the LIS being the following:
(a) put in the list all the pixels that verify x=0(mod.2) and y=0(mod.2) and z=1, for the luminance component Y and then for the chrominance components U and V;
(b) put in the list all the pixels that verify x=1(mod.2) and y=0(mod.2) and z=0, for Y and then for U and V;
(c) put in the list all the pixels that verify x=1(mod.2) and y=1(mod.2) and z=0, for Y and then for U and V;
(d) put in the list all the pixels that verify x=0(mod.2) and y=1(mod.2) and z=0, for Y and then for U and V;
(B) an exploration step, in which the spatio-temporal orientation trees defining the spatio-temporal relationship in the hierarchical subband pyramid of the wavelet decomposition are explored from the lowest resolution level to the highest one, while keeping neighboring pixels together and taking account of the orientation of the details, said exploration of the offspring coefficients being implemented thanks to a scanning order of the offspring coefficients that is shown in FIG. 7, in the case of horizontal and diagonal detail subbands, for a group of four offspring and the passage of said group to the next one in the horizontal direction, in FIG. 8 for a group of four offspring and the passage of said group to the next one in the vertical direction, and in FIGS. 9 and 10 respectively for the lowest resolution level and for the finer resolution levels.
The initialization of the LIS plays an important role in the progress of the algorithm. A special organization of this list, a particular scan of offspring coefficients and slight modifications of the original algorithm allow to explore the trees in depth while keeping neighboring pixels together and taking account of the orientation of the details.