In most existing video compression systems and standards such as MPEG-2 and JVT/H.264/MPEG AVC, encoders and decoders mainly rely on intra-coding and inter-coding in order to achieve compression. In intra-coding, spatial prediction methods are used, while for inter-coding compression is achieved by exploiting the temporal correlation that may exist between pictures.
More specifically, previously encoded/decoded pictures are used as references for future pictures, while motion estimation and compensation is employed in order to compensate for any motion activity between these pictures. FIG. 1A illustrates motion compensation in P pictures (frames), while FIG. 1B illustrates motion compensation in B pictures (frames). More advanced codecs such as H.264 also consider lighting variations (e.g., during fade in/out) in order to generate a more accurate prediction if necessary. Finally, deblocking methods may also be used in an effort to reduce blocking artifacts created through the prediction and quantization processes.
Fractional sample interpolation is one of the techniques employed to further enhance the quality of motion compensated prediction, since it allows for a more precise representation of motion. Instead of using the actual samples of a reference, a filtering mechanism is employed where the samples within a reference are first filtered (interpolated) using a previously defined filter. FIG. 2 illustrates integer samples (shaded blocks with upper-case letters) and fractional sample positions (un-shaded blocks with lower-case letters) for quarter sample luma interpolation. Due to the non-ideal nature of the low-pass filters used during the image acquisition process, aliasing can be generated which can deteriorate the interpolation and the motion compensated prediction.
Most video coding architectures and coding standards, such as MPEG-1/2, H.263 and H.264 (or JVT or MPEG-4 AVC) employ fractional sample motion compensation to further improve the efficiency of motion compensated prediction. Older standards are primarily based on bilinear interpolation strategies for the generation of the fractional sample positions. In an attempt to reduce aliasing, the H.264 standard (or JVT or MPEG4 AVC) uses a 6 tap Wiener interpolation filter, with filter coefficients (1, −5, 20, 20 −5, 1)/32, during the interpolation process down to a ¼ fractional sample position. FIG. 3 illustrates the interpolation process in H.264. Referring to FIG. 3, a non-adaptive 6-tap filter is used to generate sample values at ½ fractional sample position. Then, a non-adaptive bilinear filter filters the samples at the ½ fractional positions to generate sample values at ¼ fractional sample positions. More specifically, for luma, given the samples ‘A’ to ‘U’ at full-sample locations (xAL, yAL) to (xUL, yUL), the samples ‘a’ to ‘s’ at fractional sample positions need to be derived. This is done by first computing the prediction values at half sample positions (aa-hh and b,hj,m and s) by applying the filter mentioned above, while afterwards, the prediction values at quarter sample positions are derived by averaging samples at full and half sample positions. For chroma, on the other hand, bilinear interpolation down to ⅛th sample positions is used. However, different video signals may have different non-stationary statistical properties (e.g., aliasing, texture, and motion), and therefore the use of fixed filters may still be insufficient.
Adaptive fractional sample interpolation schemes have been discussed that allow better consideration of aliasing during the interpolation process. Instead of a fixed 6-tap filter as the one used by H.264, additional side information is transmitted for every frame which represents the filter coefficients of the filter that will be used during interpolation. More specifically, an adaptive filter of the form {a1, a2, a3, a3, a2, a1} can be used to generate all ½ sample positions, followed by bilinear interpolation for the generation of ¼ samples. Considering the symmetric nature of the above filter, only 3 coefficients (a1, a2, and a3) had to be encoded. This method could be easily extended to allow longer or shorter tap filters.
In another prior technique, instead of coding the coefficients of the filters explicitly, a codebook of filters is generated based on a typical distribution of filter coefficients. This could provide both a decrease in complexity at the encoder (only a given set of coefficients may need to be tested, although one may argue that an a priori decision could also be used to determine an appropriate range of filtering coefficients), but most importantly a somewhat improved/reduced representation of the filtering coefficients (i.e., instead of requiring 3*12 bits to represent the filtering coefficients), one now only needs N bits to represent up to 2N different filters, assuming that all filters have equal probability. Additional considerations may be made by considering different filters at different ½ or ¼ sample positions, which can essentially be seen as an adaptation of the interpolation filter using the sample position as the indicator.
Apart from frame/global based filter adaptation, the possibility of adapting filtering parameters at the block level have been discussed. In one prior technique, for each block, a 4 tap filter is used and transmitted during encoding. Although this method could improve the motion compensated prediction signal, this could not justify the significant increase in terms of bit overhead due to the additional transmission of the filters. Also, mentioned that little correlation is seen between interpolation filters of adjacent blocks. Therefore, this method appears to be impractical and inefficient. However, a Macroblock (MB) based interpolation method may be used which signaled and only considered a predefined set of interpolation filters. A decision of whether to transmit and use these interpolation filters is made at the picture level.
Some global based interpolation methods do not consider the local characteristics of the signal and therefore their performance might be limited. Furthermore, no proper consideration is made in the presence of multiple references such as in bi-prediction. In one prior technique, the interpolation filter for each reference is essentially signaled only once per reference, and the same interpolation is used for subsequent pictures that reference this picture. However, one may argue that for every coded frame, different interpolation filters may be required for all its references, since characteristics of motion, texture etc, and the relationship between references may be changing in time. For example, assuming that a transformation of Pn=fn,k(Pk) is required to generate a picture Pn from its reference Pk. Pk on the other hand may have a relationship Pk=fk,j(Pj) with a reference Pj which implies that the use of fk.j0 when referencing Pj may not be appropriate. Furthermore, no consideration is made for bi-predicted partitions, for which the uni-prediction interpolation filters might not be appropriated.
On the other hand, block based methods may suffer from either considerably increased overhead for the encoding of the interpolation filters, or lack of flexibility in terms of the filters used. Again, no consideration of bi-prediction is made.
Thus, adaptive interpolation schemes were recently proposed that try to take in account such properties and adapt such interpolation filters for every frame. Such schemes essentially require the transmission of the filtering parameters used for every frame, while also an estimation process of such parameters is also necessary. Unfortunately, the methods presented do no present a best mode of operation for such techniques, therefore resulting in increased overhead and therefore reduced performance. Furthermore, adaptation is essentially performed at a frame (global) level, and no local characteristics are considered.