Typical video codecs are based on motion compensated prediction and prediction error coding. Motion compensated prediction is obtained by analyzing and coding motion between video frames and reconstructing image segments using the motion information. Prediction error coding is used to code the difference between motion compensated image segments and corresponding segments in the original image. The accuracy of prediction error coding can be adjusted depending on the available bandwidth and the required quality of the coded video. In a typical Discrete Cosine Transform (DCT) based system this is done by varying the quantizer parameter (QP) used in quantizing the DCT coefficients to a specific accuracy.
Coding systems, in general, provide a set of parameters to represent the coded signals. These parameters are entropy coded and sent to a decoder for decoding and reconstruction of the coded signal. To improve the compression performance of the entropy coder, the parameters are often predicted from the information available for both encoder and decoder. By doing this, the entropy coder needs to code only small variance differences between the actual parameter values and the predicted ones, leading to a coding gain.
A digital image is usually represented by equally spaced samples arranged in the form of an N×M array as shown below, where each element of the array is a discrete quantity. Elements F(x, y) of this array are referred to as image elements, picture elements, pixels or pels. Coordinates (x, y) denote the location of the pixels within the image and pixel values F(x, y) are only given for integer values of x and y.
      [                                        F            ⁡                          (                              0                ,                0                            )                                                            F            ⁡                          (                              0                ,                1                            )                                                ⋯                                      F            ⁡                          (                                                0                  ,                  M                                -                1                            )                                                                        F            ⁡                          (                              1                ,                0                            )                                                            F            ⁡                          (                              1                ,                1                            )                                                ⋯                                      F            ⁡                          (                                                1                  ,                  M                                -                1                            )                                                            ⋮                          ⋮                          ⋰                          ⋮                                                  F            ⁡                          (                              N                -                                  1                  ,                  0                                            )                                                            F            ⁡                          (                              N                -                                  1                  ,                  1                                            )                                                ⋯                                      F            ⁡                          (                              N                -                                  1                  ,                  M                                -                1                            )                                            ]     A typical video coder employs three types of pictures: intra pictures (I-pictures), predicted pictures (P-pictures) and bi-directionally predicted or bi-predicted pictures (B-pictures). FIG. 1a shows a typical example of a video sequence consisting of an I-picture and a P-picture. I-pictures are independently decodable in the sense that the blocks in an I-picture (I-blocks) do not depend on any reference pictures. A P-picture can depend on available reference pictures such that a block in a P-picture can be either an I-block, or a P-block that depends on one reference picture. FIG. 1b shows a typical example of a video sequence consisting of an I-picture, a B-picture and a P-picture. A B-picture can depend on temporally preceding and following pictures. A block in a B-picture can be an I-block, a P-block or a B-block that depends on two reference pictures.
P-pictures exploit temporal redundancies between the successive frames in the video sequence. When a picture of the original video sequence is encoded as a P-picture, it is partitioned into rectangular regions (blocks), which are predicted from one of the previously coded and transmitted frames Fref, called a reference picture. The prediction information of a block is represented by a two-dimensional motion vector (Δx, Δy) where Δx is the horizontal and Δy is the vertical displacement. The motion vectors, together with the reference picture, are used during motion compensation to construct samples in prediction picture Fpred:Fpred(x,y)=Fref(x+Δx, y+Δy)
The motion vectors are found during the motion estimation process. The prediction error, i.e., the difference between the original picture and the prediction picture Fpred, is compressed by representing its values as a set of weighted basis functions of some discrete transform. The transform is typically performed on an 8×8 or 4×4 block basis. The weights, which are the transform coefficients, are subsequently quantized. Quantization introduces a loss of information since the quantized coefficients have lower precision than the original ones.
The quantized transform coefficients, together with motion vectors and some control information, form a complete coded P-picture representation. These different forms of information are known collectively as syntax elements. Prior to transmission from the encoder to the decoder, all syntax elements are entropy coded, which further reduces the number of bits needed for their representation. Entropy coding is a loss-less operation aimed at minimizing the number of bits required to represent transmitted or stored symbols by utilizing properties of their distribution (some symbols occur more frequently than others).
In the decoder, a P-picture is obtained by first constructing the prediction picture in the same manner as in the encoder and by adding to the prediction picture the compressed prediction error. The compressed prediction error is found by weighting the transform basis functions using the quantized transform coefficients. The difference between the reconstructed picture Frec and the original picture is called the reconstruction error.
Since motion vectors (Δx, Δy) can have non-integer values, motion compensated prediction requires evaluating picture values of the reference picture Fref at non-integer locations (x′, y′)=(x+Δx, y+Δy). A picture value at a non-integer location is referred to as a sub-pixel value and the process of determining such a value is called interpolation. Calculation of a sub-pixel value F(x,y) is done by filtering surrounding pixels:
            F      ⁡              (                              x            ′                    ,                      y            ′                          )              =                  ∑                  k          =                                    -              K                        +            1                          K            ⁢                          ⁢                        ∑                      l            =                                          -                L                            +              1                                L                ⁢                              f            ⁡                          (                              k                ,                l                            )                                ⁢                      F            ⁡                          (                                                n                  +                  k                                ,                                  m                  +                  l                                            )                                            ,where f(k,l) are filter coefficients and n and m are obtained by truncating x′ and y′, respectively, to integer values. The filter coefficients are typically dependent on the x′ and y′ values. The interpolation filters employed are usually separable, in which case sub-pixel value F(x′, y′) can be calculated as follows:
      F    ⁡          (                        x          ′                ,                  y          ′                    )        =            ∑              k        =                              -            K                    +          1                    K        ⁢                  f        ⁡                  (          k          )                    ⁢                        ∑                      l            =                                          -                L                            +              1                                L                ⁢                              f            ⁡                          (              l              )                                ⁢                                    F              ⁡                              (                                                      n                    +                    k                                    ,                                      m                    +                    l                                                  )                                      .                              In the case of B-pictures, it is possible to predict one block from two different reference pictures. For each block there can be two sets of motion vectors (Δx1, Δy1) and (Δx2, Δy2), one for each reference picture used. The prediction is a combination of pixel values from those two pictures. Typically, pixel values of the two reference pictures are averaged:Fpred(x,y)=(F1(x+Δx1, y+Δy1)+F2(x+Δx2, y+Δy2))/2
Interpolation of pixels in non-integer positions is performed by applying a filter on the neighboring pixel values. Usually, higher order filters produce better results. When multi-picture prediction is used (in B-pictures, for example), interpolation has to be performed for each picture from which pixels are fetched. Therefore, prediction from two reference pictures requires twice the number of interpolations compared with prediction from only one picture. Thus, the complexity of multi-picture prediction is significantly higher than that of single picture prediction.
In the image coding system of the present invention, all the motion information that is used for motion compensation is similar to that specified in existing video coding standards such as H.263 and H.264. For example, according to the draft version of the H.264 video coding standard presented in the document by T. Wiegand: “Joint Committee Draft (CD) of Joint Video Specification (ITU-T rec. H.264 ISO/IEC 14496-10 AVC”, Doc. VT-C167, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, May 2002, all P-blocks are predicted using combinations of a 6-tap interpolation filter with coefficients (1, −5, 20, 20, −5, 1)/32 and a bilinear filter. This filtering scheme will now be described in conjunction with FIG. 2. In the figure, the positions labeled “A” represent reference picture samples at integer positions. Other symbols represent interpolated values at fractional sample positions.
According to the H.264 video coding standard, sub-pixel value interpolation can be applied to both the luminance (luma) and chrominance (chroma) components of a picture. However, for simplicity, only interpolation of sub-pixel values in the luminance component will be described here. Depending on the complexity and resolution requirements of the motion compensation process, sub-pixel value prediction in the luminance component can be carried out at quarter sample resolution or one-eighth sample resolution. Again, for simplicity, only quarter sample interpolation will be described below, but it should be appreciated that the exact details of the sub-pixel value interpolation process and the resolution of the interpolation does not affect the applicability of the method according to the present invention.
According to the quarter sample resolution sub-pixel value interpolation procedure defined according to H.264, prediction values at quarter sample positions are generated by averaging samples at integer and half sample positions. The process for each position is described below, with reference to FIG. 2.                The samples at half sample positions labeled ‘bh’ are obtained by first calculating an intermediate value b by applying the 6-tap filter (described above) to the nearest samples ‘A’ at integer positions in the horizontal direction. The final value of ‘bh’ is calculated according to:bh=clip1((b+16)>>5)where x>>n denotes the arithmetic right shift of a two's complement integer representation of x by n binary digits and the mathematical function ‘clip 1’ is defined as follows:        
                              clip1          ⁢                      (            c            )                          =                  clip3          ⁡                      (                          0              ,              255              ,              c                        )                                                                                                      clip3                ⁡                                  (                                      a                    ,                    b                    ,                    c                                    )                                            =                                                a                  ⁢                                                                          ⁢                  if                  ⁢                                                                          ⁢                  c                                <                a                                                                                                        =                                                      b                    ⁢                                                                                  ⁢                    if                    ⁢                                                                                  ⁢                    c                                    >                  b                                            ,              or                                                                          =                              c                ⁢                                                                  ⁢                                  otherwise                  .                                                                                        The samples at half sample positions labeled ‘bv’ are obtained equivalently with the filter applied in the vertical direction.        The samples at half sample positions labeled ‘cm’ are obtained by applying the 6-tap filter to the intermediate values b of the closest half sample positions in either the vertical or horizontal direction to form an intermediate result c. The final value is calculated using the relationshipcm=clip1((c+512)>>10).        The samples at quarter sample positions labeled ‘d’, ‘g’, ‘e’ and ‘f’ are obtained by averaging with truncation the two nearest samples at integer or half sample position, as follows:d=(A+bh)>>1g=(bv+c)>>1e=(A+bv)>>1f=(bh+cm)>>1.        The samples at quarter sample positions labeled ‘h’ are obtained by averaging with truncation the closest ‘bh’ and ‘bv’ samples in a diagonal direction using the relationshiph=(bh+bv)>>1.        The samples at quarter sample positions labeled ‘i’ are computed using the four nearest samples at integer positions using the relationshipi=(A1+A2+A3+A4+2)>>2.        
In existing video coding standards, such as MPEG-1, MPEG-2, MPEG-3, H.263 and H.264, the same interpolation filter is applied regardless of the type of prediction. It has been found that application of the interpolation filter in this manner is not always efficient. It is advantageous and desirable to provide a method and system for digital image coding which reduces the complexity in picture prediction.