For processing digital imagery into scalable bitstreams, the discrete wavelet transform (DWT) uses multi-resolution analysis to decompose images into a set of subbands, each of which contains specific image information relevant to a given resolution of the images. For example, a low resolution subband at a particular resolution level may appear as a reduced version of the original image while more detailed subbands may contain more detailed horizontal, vertical, and diagonal information related to local texture and edges at a given resolution. Wavelets can yield a signal representation in which lowpass coefficients represent the most slowly changing data while highpass coefficients represent more fast-moving localized changes. Thus, DWT provides a schema in which short-term changes and long-term trends can be analyzed, compared, and processed on equal footing. Because of the ability of DWT to support spatial scalability, more recent compression standards have begun to adopt DWT as the spatial energy compaction tool instead of the discrete cosine transform (DCT).
Conventionally, many implementations of DWT employ a filter bank consisting of a pair of complementary 1-dimensional (1D) highpass/lowpass filters followed by a subsampling operation. In the conventional case of the 2-dimensional (2D) horizontal and vertical dimensions of a video frame, identical 1D filter banks are applied, first along each image row and then along each image column, producing four subbands (referred to as LL, HL, LH, and HH). In one setup, for an n-level transformation, the 2D filter bank is recursively applied “n” times to the LL subband obtained at each level. The four subbands, LL, HL, LH, and HH, are designated with “L” or “H” according to whether a lowpass filter (L) or a highpass filter (H) is applied horizontally (the first letter) and/or vertically (the second letter).
The lowpass (LL) information is often used as the basis for motion prediction since much of the signal energy is connected to this subband and it is often the first to be sent in progressive transmission schemata in order to make LL available for deriving the other bands at a decoder. The LH and HL subbands contain a majority of the highpass energy. These subbands have frequency responses that overlap with the LL band over a wide range of frequencies. The aliasing caused by decimation in wavelet decomposition makes it impossible to do direct band-to-band motion estimation between highpass subbands in neighboring video frames. Thus, to avoid the aliasing effect caused by decimation, lowpass subbands (e.g., LL) are relied upon for motion estimation in the wavelet domain.
A lifting schema is an alternative way to compute the DWT. Lifting schemata usually replace the lowpass/highpass filter pair by a “ladder” consisting of dual lifting steps that include “prediction” filters using a prediction operator P( ) and lifting steps using an update filter U( ). At the end of the ladder procedure, a scaling step is applied to obtain the lowpass and highpass subbands. This lifting technique using a ladder procedure provides several benefits over conventional filter banks. For example, it may reduce computations and allow more efficient filter management. Lifting-based wavelet transforms may use the 9/7 wavelet base, which provides lossy compression, or the 5/3 wavelet base which can be used as an “integer wavelet transform” for lossless coding.
In-band motion compensated temporal filtering (IBMCTF or just “in-band MCTF”) is based on the extension of a conventional MCTF concept into the wavelet domain. In 3-dimensional (3D) wavelet coding, the entire video sequence is decomposed into many temporal-spatial subbands through a number of motion aligned temporal transforms and spatial transforms. These subbands are assumed to be independent and some of them can be dropped when some type of resolution scalability is demanded. For example, to support spatial scalability, the spatial high-pass subbands are usually dropped and the decoder just carries out the decoding process with only the received data that is in spatial lowpass subbands, e.g., the LL subband.
In the in-band MCTF schema, the original video is first spatially decomposed and then the MCTF is carried out in the wavelet domain, possibly with subsequent further spatial decompositions. In-band MCTF allows adaptive processing for each subband, that is, each subband can have a different motion estimation accuracy, different interpolation filters, different temporal filter taps, etc. Thus, in-band MCTF is gaining popularity because it is a general and flexible coding framework that directly supports and offers advantages for spatial scalability as compared with spatial domain MCTF schemata.
Conventionally, for a Common Intermediate Format (CIF) video sequence, if one-level spatial scalability is demanded at the decoder, the encoder only has to include the context information of the spatial LL band in the bitstream being encoded. The context of the LH, HL and HH subbands can be dropped to meet bandwidth characteristics or limitations. However, to reduce the effect of wavelet shift-variance on the efficiency of motion estimation and motion compensation in the wavelet domain, a “low-band shift method” (LBS) was developed to perform the motion estimation and motion compensation more efficiently with an “overcomplete” form of the reference band (Hyun-Wook Park, Hyung-Sun Kim, “Motion Estimation Using Low-Band-Shift Method for Wavelet-Based Moving-Picture Coding”, /IEEE Trans. on Image Processing, VOL. 9, No. 4, pp. 577-587, April 2000). This LBS method allows wavelet domain motion estimation and motion compensation using shift-invariant overcomplete wavelets. Overcomplete lowpass (LL) band information is thus distinguishable from “ordinary” spatial lowpass (LL) band information.
As shown in FIG. 1, problems can arise when an LBS reference frame, denoted as IP_LBS 100, is used with in-band MCTF for a bitstream that is to provide a low resolution mode within spatial scalability. Even though MCTF that is based on LBS can remarkably improve coding efficiency in the wavelet domain, some of the spatial high band information that is included in the LBS schema for coding the low band information into the overcomplete LL band 102 used at the encoder 104, cannot be obtained at the decoder 106 when the decoder 106 executes a low spatial resolution display. That is, in some cases only reference frames based on ordinary LL band information 108 may be obtainable at the decoder 106.
For example, assume that the original video sequence is CIF video and one-level spatial scalability is demanded at the decoder 106. In the case of a quarter-pixel mode of motion estimation 110 and motion compensation 112, the interpolation reference frame, IP_LBS 100, is obtained at the encoder 104 by half-pixel interpolation of each band in the corresponding overcomplete sub-band of original video. At the decoder 106, when decoding the lower resolution QCIF (quarter CIF) video sequence, only the ordinary spatial LL band 108 (i.e., the spatial lowpass band, which represents the low resolution's video signals) can be obtained. Instead of half-pixel interpolation as at the encoder 104, direct quarter-pixel interpolation is applied to this spatial LL band 108 at the decoder 106 to generate the reference frame, in this case denoted by IP_DIR 114. Because of the mismatch of interpolation reference frames between encoder 104 and decoder 106, the well-known phenomenon of drifting error will occur when decoding at the lower resolution when IP_LBS 100 is used as the reference for the LL band. However, since IP_LBS 100 contains more information from the original video frames, including low-pass information and high-pass information, IP_LBS 100 is inherently a better overall reference than IP_DIR 114.
In FIG. 2, another technique is adopted in an attempt to resolve the drifting error problem just described. The encoder 104 uses MCTF with only the ordinary spatial lowpass band information 108. This technique, however, brings coding performance loss when the full spatial resolution sequence is decoded. This is because the ordinary spatial lowpass band 108, by itself, does not have all of the high band information 102 that the LL band includes when IP_LBS 100 is used as the reference for the LL band.