1. Field of the Invention
The invention relates to methods of encoding and decoding a bit stream comprising a representation of a sequence of n-dimensional data structures or matrices, in which n is typically 2. The invention is particularly relevant to in-band motion estimation/motion compensation of video images.
2. Description of the Related Art
Wavelet-based coding has been generally accepted as the most efficient technique for still-picture compression. Wavelet transform schemes are described in detail in “Wavelets and Subbands”, by Abbate, DeCusatis and Das, Birkhäuser press, 2002. The insertion of discrete wavelet transforms (DWT) in the new JPEG-2000 coding standard led to increased coding efficiency in comparison to previous standards in this area, and additionally, a number of interesting features including quality and resolution scalability, stemming from the multiresolution nature of the transform are provided. In the video-coding arena, although such scalability features (along with temporal scalability) are highly desired in a number of applications (like video streaming and multimedia over networks), wavelets are employed only for the texture coding in the MPEG-4 standard at present. To address scalability, the MPEG-4 standard adopts the multiresolution DCT approach within a hybrid coding structure, which performs relatively poorly in the complexity versus coding efficiency sense in comparison to wavelets. For these reasons, many authors have begun to explore wavelet-based scalable video-coding schemes. Until recently, the research efforts were mainly directed towards the use of 3-D wavelet decompositions for each input group of frames (GOF) in order to remove the spatial and the temporal redundancies in the video stream. This work was pioneered mainly by Karlsson and Vetterli [1], Lewis and Knowles [2], and more recently by Ohm [3] and Taubman and Zakhor [4] who introduced 3-D decompositions coupled with motion estimation (ME) and motion compensation (MC). More recent algorithms proposed by Kim, Xiong and Pearlman [5] and Bottreau et al [6] support all types of scalability (spatial, quality and temporal by using 3-D versions of the SPIHT algorithm [7] and hierarchical spatial-domain techniques for block-based ME and MC. A wavelet decomposition using a short filter-pair like the Haar transform is performed in the temporal direction to remove the redundancies between successive residual frames. Furthermore, a 2-D wavelet decomposition of the motion compensated sequence (i.e. the residual frames) is performed to reduce spatial redundancies and to compact the energy in the lower-frequency subbands (using classical filters from still-image coding, such as the 9/7 filter-pair). Quality scalability can be obtained with this type of algorithms by coding the three-dimensional transform-domain coefficients using the 3-D extensions [5] of the classical 2-D embedded zerotree-based [7] or block-based wavelet image coders [8][9]. Spatial scalability can be achieved only if the motion compensation is performed in a level-by-level manner. In addition, temporal scalability is inherent to such schemes, since in a multilevel temporal decomposition each resolution reconstructs to a dynamically-reduced frame-rate for the decoded sequence. In conclusion, these schemes algorithmically satisfy the scalability issues, and moreover, they provide good coding performance. Nevertheless, their limitation comes from the implementation point of view because they require a large memory budget for the 3-D transform-application to each GOF, and they distribute almost equally the computational load between the encoder and the decoder, thus making the decoder implementation relatively complex. In addition, the complete codec delay is also increased since the decoder can receive compressed data only after the full 3-D transform is completed in the current GOF of the encoder. Thus they are insufficient for bi-directional communications and for applications where power dissipation and memory are major cost issues, i.e. for portable systems. Other approaches for scalable wavelet video coding which try to reduce the implementation complexity and the system delay follow the classical MPEG-alike hybrid coding-structure, where the ME/MC is also performed in the spatial domain and the DCT transform is replaced with a wavelet transform. Typical examples of such systems are described in [10] and [11]. Although scalability in quality can be achieved by embedded wavelet coding [7][8][9], the main drawback of such techniques is that they fail to take advantage of the inherent multiresolution structure of the wavelet transform to provide drift-free spatial scalability. In addition, there is an inverse transform in the coding loop, resulting to two transform-applications (one forward and one inverse) per frame and per spatial resolution. This may also lead to large codec delays, since no parallelism is possible and each wavelet transform is applied to the complete frame. More recent research efforts tie the classic hybrid coding-structure with motion estimation and compensation techniques in the wavelet domain, leading to the so called in-band ME/MC class of wavelet video codecs [12][13][14][16]. This class of codecs presents a conceptually more appealing approach since the multiresolution features of the transform can be used so as to provide an inherent spatial and quality scalability similar to wavelet-based coding of still images. Hence, if motion compensation is performed in a level-by-level manner in the wavelet subbands, a decoder can decode without drift a video sequence with the horizontal and vertical frame-dimensions having half or quarter-size, since the same information as the encoder is utilized. In addition, the complexity is reduced, since the inverse wavelet transform is removed from the coding loop. However, a major bottleneck for this approach is that the classical dyadic wavelet decomposition (named also as the critically-sampled representation) is only periodically shift-invariant [14][16][17], with a period that corresponds to the subsampling factor of the specific decomposition level. Hence, accurate motion estimation is not feasible by using only the critically-sampled pyramid. Extensive research efforts have been spent in the recent years to overcome the shift-variance problem of the critically sampled wavelet transform. One common alternative is to use near shift-invariant wavelet transforms, and there are many solutions in the literature for this type of transforms. However, their main limitation stems from the fact that they all imply some degree of redundancy in comparison to the critically-sampled decomposition [18][19]. An example of a video-coding scheme that utilizes such a near shift-invariant transform, namely the complex wavelet transform of Kingsbury [18], is presented in [15]. The redundancy factor for this transform is four. Although the coding results obtained with this technique seem promising, the main disadvantage is that after performing in-band motion estimation/motion compensation (ME/MC), the error frames contain four times more wavelet coefficients than the input-frame samples. As a consequence, the error-frame coding tends to be inefficient, thus more complex error-frame coding algorithms should be envisaged to improve the coding performance. With this respect, it is important to notice that there is a trade-off between critical sampling implying efficient error-frame coding and redundancy of the transform implying near shift invariance. A completely different solution breaking this trade-off, that is, overcoming the shift-variance problem of the DWT while still producing critically sampled error-frames is the low-band shift method (LBS) introduced theoretically in [16] and used for in-band ME/MC in [14]. Firstly, this algorithm reconstructs spatially each reference frame by performing the inverse DWT. Subsequently, the LBS method is employed to produce the corresponding overcomplete wavelet representation, which is further used to perform in-band ME and MC, since this representation is shift invariant. Basically, the overcomplete wavelet decomposition is produced for each reference frame by performing the “classical” DWT followed by a unit shift of the low-frequency subband of every level and an additional decomposition of the shifted subband. Hence, the LBS method effectively retains separately the even and odd polyphase components of the undecimated wavelet decomposition [17]. The “classical” DWT (i.e. the critically-sampled transform) can be seen as only a subset of this overcomplete pyramid that corresponds to a zero shift of each produced low-frequency subband, or conversely to the even-polyphase components of each level's undecimated decomposition. The motion vectors can be detected by searching directly in the overcomplete wavelet representation of the reference frame to find the best match for the subband information present in the critically-sampled transform of the current frame. The motion compensation for the current frame is then performed directly in its critically-sampled decomposition. Hence, the produced error-frames are still critically-sampled. In comparison to image-domain ME/MC, the in-band ME/MC results of [14] demonstrate competitive coding performance, especially for high coding-rates.