Digital video sequences, like ordinary motion pictures recorded on film, comprise a sequence of still images, the illusion of motion being created by displaying the images one after the other at a relatively fast frame rate, typically 15 to 30 frames per second. Because of the relatively fast frame rate, images in consecutive frames tend to be quite similar and thus contain a considerable amount of redundant information. For example, a typical scene may comprise some stationary elements, such as background scenery, and some moving areas, which may take many different forms, for example the face of a newsreader, moving traffic and so on. Alternatively, the camera recording the scene may itself be moving, in which case all elements of the image have the same kind of motion. In many cases, this means that the overall change between one video frame and the next is rather small. Of course, this depends on the nature of the movement. For example, the faster the movement, the greater the change from one frame to the next. Similarly, if a scene contains a number of moving elements, the change from one frame to the next is likely to be greater than in a scene where only one element is moving.
It should be appreciated that each frame of a raw, that is uncompressed, digital video sequence comprises a very large amount of image information. Each frame of an uncompressed digital video sequence is formed from an array of image pixels. For example, in a commonly used digital video format, known as the Quarter Common Interchange Format (QCIF), a frame comprises an array of 176×144 pixels, in which case each frame has 25,344 pixels. In turn, each pixel is represented by a certain number of bits, which carry information about the luminance and/or colour content of the region of the image corresponding to the pixel. Commonly, a so-called YUV colour model is used to represent the luminance and chrominance content of the image. The luminance, or Y, component represents the intensity (brightness) of the image, while the colour content of the image is represented by two chrominance components, labelled U and V.
Colour models based on a luminance/chrominance representation of image content provide certain advantages compared with colour models that are based on a representation involving primary colours (that is Red, Green and Blue, RGB). The human visual system is more sensitive to intensity variations than it is to colour variations; YUV colour models exploit this property by using a lower spatial resolution for the chrominance components (U, V) than for the luminance component (Y). In this way the amount of information needed to code the colour information in an image can be reduced with an acceptable reduction in image quality.
The lower spatial resolution of the chrominance components is usually attained by sub-sampling. Typically, a block of 16×16 image pixels is represented by one block of 16×16 pixels comprising luminance information and the corresponding chrominance components are each represented by one block of 8×8 pixels representing an area of the image equivalent to that of the 16×16 pixels of the luminance component. The chrominance components are thus spatially sub-sampled by a factor of 2 in the x and y directions. The resulting assembly of one 16×16 pixel luminance block and two 8×8 pixel chrominance blocks is commonly referred to as a YUV macroblock, or macroblock, for short.
A QCIF image comprises 11×9 macroblocks. If the luminance blocks and chrominance blocks are represented with 8 bit resolution (that is by numbers in the range 0 to 255), the total number of bits required per macroblock is (16×16×8)+2×(8×8×8)=3072 bits. The number of bits needed to represent a video frame in QCIF format is thus 99×3072=304,128 bits. This means that the amount of data required to transmit/record/display a video sequence in QCIF format, represented using a YUV colour model, at a rate of 30 frames per second, is more than 9 Mbps (million bits per second). This is an extremely high data rate and is impractical for use in video recording, transmission and display applications because of the very large storage capacity, transmission channel capacity and hardware performance required.
If video data is to be transmitted in real-time over a fixed line network such as an ISDN (Integrated Services Digital Network) or a conventional PSTN (Public Service Telephone Network), the available data transmission bandwidth is typically of the order of 64 kbits/s. In mobile videotelephony, where transmission takes place at least in part over a radio communications link, the available bandwidth can be as low as 20 kbits/s. This means that a significant reduction in the amount of information used to represent video data must be achieved in order to enable transmission of digital video sequences over low bandwidth communication networks. For this reason video compression techniques have been developed which reduce the amount of information transmitted while retaining an acceptable image quality.
Video compression methods are based on reducing the redundant and perceptually irrelevant parts of video sequences. The redundancy in video sequences can be categorised into spatial, temporal and spectral redundancy. ‘Spatial redundancy’ is the term used to describe the correlation between neighbouring pixels within a frame. The term ‘temporal redundancy’ expresses the fact that the objects appearing in one frame of a sequence are likely to appear in subsequent frames, while ‘spectral redundancy’ refers to the correlation between different colour components of the same image.
Sufficiently efficient compression cannot usually be achieved by simply reducing the various forms of redundancy in a given sequence of images. Thus, most current video encoders also reduce the quality of those parts of the video sequence which are subjectively the least important. In addition, the redundancy of the compressed video bit-stream is itself reduced by means of efficient loss-less encoding. Typically, this is achieved using a technique known as ‘variable length coding’ (VLC).
Modern video compression standards, such as ITU-T recommendations H.261, H.263(+)(++), H.26L and the Motion Picture Experts Group recommendation MPEG-4 make use of ‘motion compensated temporal prediction’. This is a form of temporal redundancy reduction in which the content of some (often many) frames in a video sequence is ‘predicted’ from other frames in the sequence by tracing the motion of objects or regions of an image between frames.
Compressed images which do not make use of temporal redundancy reduction are usually called INTRA-coded or I-frames, whereas temporally predicted images are called INTER-coded or P-frames. In the case of INTER frames, the predicted (motion-compensated) image is rarely precise enough to represent the image content with sufficient quality, and therefore a spatially compressed prediction error (PE) frame is also associated with each INTER frame. Many video compression schemes can also make use of bi-directionally predicted frames, which are commonly referred to as B-pictures or B-frames. B-pictures are inserted between reference or so-called ‘anchor’ picture pairs (I or P frames) and are predicted from either one or both of the anchor pictures. B-pictures are not themselves used as anchor pictures, that is no other frames are predicted from them, and therefore, they can be discarded from the video sequence without causing deterioration in the quality of future pictures.
The different types of frame that occur in a typical compressed video sequence are illustrated in FIG. 3 of the accompanying drawings. As can be seen from the figure, the sequence starts with an INTRA or I frame 30. In FIG. 3, arrows 33 denote the ‘forward’ prediction process by which P-frames (labelled 34) are formed. The bidirectional prediction process by which B-frames (36) are formed is denoted by arrows 31a and 31b, respectively.
A schematic diagram of an example video coding system using motion compensated prediction is shown in FIGS. 1 and 2. FIG. 1 illustrates an encoder 10 employing motion compensation and FIG. 2 illustrates a corresponding decoder 20. The encoder 10 shown in FIG. 1 comprises a Motion Field Estimation block 11, a Motion Field Coding block 12, a Motion Compensated Prediction block 13, a Prediction Error Coding block 14, a Prediction Error Decoding block 15, a Multiplexing block 16, a Frame Memory 17, and an adder 19. The decoder 20 comprises a Motion Compensated Prediction block 21, a Prediction Error Decoding block 22, a Demultiplexing block 23 and a Frame Memory 24.
The operating principle of video coders using motion compensation is to minimise the amount of information in a prediction error frame En(x,y), which is the difference between a current frame In(x,y) being coded and a prediction frame Pn(x,y). The prediction error frame is thus:En(x,y)=In(x,y)−Pn(x,y).  (1)
The prediction frame Pn(x,y) is built using pixel values of a reference frame Rn(x,y), which is generally one of the previously coded and transmitted frames, for example the frame immediately preceding the current frame and is available from the Frame Memory 17 of the encoder 10. More specifically, the prediction frame Pn(x,y) is constructed by finding so-called ‘prediction pixels’ in the reference frame Rn(x,y) which correspond substantially with pixels in the current frame. Motion information, describing the relationship (e.g. relative location, rotation, scale etc.) between pixels in the current frame and their corresponding prediction pixels in the reference frame is derived and the prediction frame is constructed by moving the prediction pixels according to the motion information. In this way, the prediction frame is constructed as an approximate representation of the current frame, using pixel values in the reference frame. The prediction error frame referred to above therefore represents the difference between the approximate representation of the current frame provided by the prediction frame and the current frame itself. The basic advantage provided by video encoders that use motion compensated prediction arises from the fact that a comparatively compact description of the current frame can be obtained by representing it in terms of the motion information required to form its prediction together with the associated prediction error information in the prediction error frame.
However, due to the very large number of pixels in a frame, it is generally not efficient to transmit separate motion information for each pixel to the decoder. Instead, in most video coding schemes, the current frame is divided into larger image segments Sk and motion information relating to the segments is transmitted to the decoder. For example, motion information is typically provided for each macroblock of a frame and the same motion information is then used for all pixels within the macroblock. In some video coding standards, such as H.26L, a macroblock can be divided into smaller blocks, each smaller block being provided with its own motion information.
The motion information usually takes the form of motion vectors [Δx(x,y),Δy(x,y)]. The pair of numbers Δx(x,y) and Δy(x,y) represents the horizontal and vertical displacements of a pixel at location (x,y) in the current frame In(x,y) with respect to a pixel in the reference frame Rn(x,y). The motion vectors [Δx(x,y),Δy(x,y)] are calculated in the Motion Field Estimation block 11 and the set of motion vectors of the current frame [Δx(•),Δy(•)] is referred to as the motion vector field.
Typically, the location of a macroblock in a current video frame is specified by the (x,y) co-ordinate of its upper left-hand corner. Thus, in a video coding scheme in which motion information is associated with each macroblock of a frame, each motion vector describes the horizontal and vertical displacement Δx(x,y) and Δy(x,y) of a pixel representing the upper left-hand corner of a macroblock in the current frame In(x,y) with respect to a pixel in the upper left-hand corner of a substantially corresponding block of prediction pixels in the reference frame Rn(x,y) (as shown in FIG. 4b).
Motion estimation is a computationally intensive task. Given a reference frame Rn(x,y) and, for example, a square macroblock comprising N×N pixels in a current frame (as shown in FIG. 4a), the objective of motion estimation is to find an N×N pixel block in the reference frame that matches the characteristics of the macroblock in the current picture according to some criterion. This criterion can be, for example, a sum of absolute differences (SAD) between the pixels of the macroblock in the current frame and the block of pixels in the reference frame with which it is compared. This process is known generally as ‘block matching’. It should be noted that, in general, the geometry of the block to be matched and that in the reference frame do not have to be the same, as real-world objects can undergo scale changes, as well as rotation and warping. However, in current international video coding standards, only a translational motion model is used (see below) and thus fixed rectangular geometry is sufficient.
Ideally, in order to achieve the best chance of finding a match, the whole of the reference frame should be searched. However, this is impractical as it imposes too high a computational burden on the video encoder. Instead, the search region is restricted to region [−p,p] around the original location of the macroblock in the current frame, as shown in FIG. 4c. 
In order to reduce the amount of motion information to be transmitted from the encoder 10 to the decoder 20, the motion vector field is coded in the Motion Field Coding block 12 of the encoder 10, by representing it with a motion model. In this process, the motion vectors of image segments are re-expressed using certain predetermined functions or, in other words, the motion vector field is represented with a model. Almost all currently used motion vector field models are additive motion models, complying with the following general formula:                               Δ          ⁢                                           ⁢                      x            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      N              -              1                                ⁢                                           ⁢                                    a              i                        ⁢                                          f                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        2        )                                          Δ          ⁢                                           ⁢                      y            ⁡                          (                              x                ,                y                            )                                      =                              ∑                          i              =              0                                      M              -              1                                ⁢                                           ⁢                                    b              i                        ⁢                                          g                i                            ⁡                              (                                  x                  ,                  y                                )                                                                        (        3        )            where coefficients ai and bi are called motion coefficients. The motion coefficients are transmitted to the decoder 20 (information stream 2 in FIGS. 1 and 2). Functions ƒi and gi are called motion field basis functions, and are known both to the encoder and decoder. An approximate motion vector field ({tilde over (Δ)}x(x,y),{tilde over (Δ)}y(x,y)) can be constructed using the coefficients and the basis functions. As the basis functions are known to (that is stored in) both the encoder 10 and the decoder 20, only the motion coefficients need to be transmitted to the encoder, thus reducing the amount of information required to represent the motion information of the frame.
The simplest motion model is the translational motion model which requires only two coefficients to describe the motion vectors of each segment. The values of motion vectors are given by:Δx(x,y)=a0Δy(x,y)=b0  (4)
This model is widely used in various international standards (ISO MPEG-1, MPEG-2, MPEG-4, ITU-T Recommendations H.261 and H.263) to describe the motion of 16×16 and 8×8 pixel blocks. Systems which use a translational motion model typically perform motion estimation at full pixel resolution or some integer fraction of full pixel resolution, for example at half or one quarter pixel resolution.
The prediction frame Pn(x,y) is constructed in the Motion Compensated Prediction block 13 in the encoder 10, and is given by:Pn(x,y)=Rn[x+{tilde over (Δ)}x(x,y),y+{tilde over (Δ)}y(x,y)]  (5)
In the Prediction Error Coding block 14, the prediction error frame En(x,y) is typically compressed by representing it as a finite series (transform) of some 2-dimensional functions. For example, a 2-dimensional Discrete Cosine Transform (DCT) can be used. The transform coefficients are quantised and entropy (for example Huffman) coded before they are transmitted to the decoder (information stream 1 in FIGS. 1 and 2). Because of the error introduced by quantisation, this operation usually produces some degradation (loss of information) in the prediction error frame En(x,y). To compensate for this degradation, the encoder 10 also comprises a Prediction Error Decoding block 15, where a decoded prediction error frame {tilde over (E)}n(x,y) is constructed using the transform coefficients. This locally decoded prediction error frame is added to the prediction frame Pn(x,y) in the adder 19 and the resulting decoded current frame Ĩn(x,y) is stored in the Frame Memory 17 for further use as the next reference frame Rn+1(x,y).
The information stream 2 carrying information about the motion vectors is combined with information about the prediction error in multiplexer 16 and an information stream 3 containing typically at least those two types of information is sent to the decoder 20.
The operation of a corresponding video decoder 20 will now be described.
The Frame Memory 24 of the decoder 20 stores a previously reconstructed reference frame Rn(x,y). The prediction frame Pn(x,y) is constructed in the Motion Compensated Prediction block 21 of the decoder 20 according to equation 5, using received motion coefficient information and pixel values of the previously reconstructed reference frame Rn(x,y) . The transmitted transform coefficients of the prediction error frame En(x,y) are used in the Prediction Error Decoding block 22 to construct the decoded prediction error frame {tilde over (E)}n(x,y). The pixels of the decoded current frame Ĩn(x,y) are then reconstructed by adding the prediction frame Pn(x,y) and the decoded prediction error frame {tilde over (E)}n(x,y):Ĩn(x,y)=Pn(x,y)+{tilde over (E)}n(x,y)=Rn[x+{tilde over (Δ)}x(x,y),y+{tilde over (Δ)}y(x,y)]+{tilde over (E)}n(x,y).  (6)
This decoded current frame may be stored in the Frame Memory 24 as the next reference frame Rn+1(x,y).
In the description of motion compensated encoding and decoding of digital video presented above, the motion vector [Δx(x,y),Δy(x,y)] describing the motion of a macroblock in the current frame with respect to the reference frame Rn(x,y) can point to any of the pixels in the reference frame. This means that motion between frames of a digital video sequence can only be represented at a resolution which is determined by the image pixels in the frame (so-called full pixel resolution). Real motion, however, has arbitrary precision, and thus the system described above can only provide approximate modelling of the motion between successive frames of a digital video sequence. Typically, modelling of motion between video frames with full pixel resolution is not sufficiently accurate to allow efficient minimisation of the prediction error (PE) information associated with each macroblock/frame. Therefore, to enable more accurate modelling of real motion and to help reduce the amount of PE information that must be transmitted from encoder to decoder, many video coding standards, such as H.263(+)(++) and H.26L, allow motion vectors to point ‘in between’ image pixels. In other words, the motion vectors can have ‘sub-pixel’ resolution. Allowing motion vectors to have sub-pixel resolution adds to the complexity of the encoding and decoding operations that must be performed, so it is still advantageous to limit the degree of spatial resolution a motion vector may have. Thus, video coding standards, such as those previously mentioned, typically only allow motion vectors to have full-, half- or quarter-pixel resolution.
Motion estimation with sub-pixel resolution is usually performed as a two-stage process, as illustrated in FIG. 5, for a video coding scheme which allows motion vectors to have full- or half-pixel resolution. In the first step, a motion vector having full-pixel resolution is determined using any appropriate motion estimation scheme, such as the block-matching process described in the foregoing. The resulting motion vector, having full-pixel resolution is shown in FIG. 5.
In the second stage, the motion vector determined in the first stage is refined to obtain the desired half-pixel resolution. In the example illustrated in FIG. 5, this is done by forming eight new search blocks of 16×16 pixels, the location of the top-left corner of each block being marked with an X in FIG. 5. These locations are denoted as [Δx+m/2,Δy+n/2], where m and n can take the values −1, 0 and +1, but cannot be zero at the same time. As only the pixel values of original image pixels are known, the values (for example luminance and/or chrominance values) of the sub-pixels residing at half-pixel locations must be estimated for each of the eight new search blocks, using some form of interpolation scheme.
Having interpolated the values of the sub-pixels at half-pixel resolution, each of the eight search blocks is compared with the macroblock whose motion vector is being sought. As in the block matching process performed in order to determine the motion vector with full pixel resolution, the macroblock is compared with each of the eight search blocks according to some criterion, for example a SAD. As a result of the comparisons, a minimum SAD value will generally be obtained. Depending on the nature of the motion in the video sequence, this minimum value may correspond to the location specified by the original motion vector (having full-pixel resolution), or it may correspond to a location having a half-pixel resolution. Thus, it is possible to determine whether a motion vector should point to a full-pixel or sub-pixel location and if sub-pixel resolution is appropriate, to determine the correct sub-pixel resolution motion vector. It should also be appreciated that the scheme just described can be extended to other sub-pixel resolutions (for example, one-quarter-pixel resolution) in an entirely analogous fashion.
In practice, the estimation of a sub-pixel value in the reference frame is performed by interpolating the value of the sub-pixel from surrounding pixel values. In general, interpolation of a sub-pixel value F(x,y) situated at a non-integer location (x, y)=(n+Δx, m+Δy), can be formulated as a two-dimensional operation, represented mathematically as:                               F          ⁡                      (                          x              ,              y                        )                          =                              ∑                          k              =                              -                K                                                    K              =              1                                ⁢                                           ⁢                                    ∑                              l                =                                  -                  L                                                            L                =                1                                      ⁢                                                   ⁢                                          f                ⁡                                  (                                                            k                      +                      K                                        ,                                          l                      +                      L                                                        )                                            ⁢                              F                ⁡                                  (                                                            n                      +                      k                                        ,                                          m                      +                      l                                                        )                                                                                        (        7        )            where f(k,l) are filter coefficients and n and m are obtained by truncating x and y, respectively, to integer values. Typically, the filter coefficients are dependent on the x and y values and the interpolation filters are usually so-called ‘separable filters’, in which case sub-pixel value F(x,y) can be calculated as follows:                               F          ⁡                      (                          x              ,              y                        )                          =                              ∑                          k              =                              -                K                                                    K              =              1                                ⁢                                           ⁢                                    f              ⁡                              (                                  k                  +                  K                                )                                      ⁢                                          ∑                                  l                  =                                      -                    K                                                                    K                  =                  1                                            ⁢                                                           ⁢                                                f                  ⁡                                      (                                          l                      +                      K                                        )                                                  ⁢                                  F                  ⁡                                      (                                                                  n                        +                        k                                            ,                                              m                        +                        l                                                              )                                                                                                          (        8        )            
The motion vectors are calculated in the encoder. Once the corresponding motion coefficients are transmitted to the decoder, it is a straightforward matter to interpolate the required sub-pixels using an interpolation method identical to that used in the encoder. In this way, a frame following a reference frame in the Frame Memory 24, can be reconstructed from the reference frame and the motion vectors.
The simplest way of applying sub-pixel value interpolation in a video coder is to interpolate each sub-pixel value every time it is needed. However, this is not an efficient solution in a video encoder, because it is likely that the same sub-pixel value will be required several times and thus calculations to interpolate the same sub-pixel value will be performed multiple times. This results in an unnecessary increase of computational complexity/burden in the encoder.
An alternative approach, which limits the complexity of the encoder, is to pre-calculate and store all sub-pixel values in a memory associated with the encoder. This solution is called interpolation ‘before-hand’ interpolation hereafter in this document. While limiting complexity, before-hand interpolation has the disadvantage of increasing memory usage by a large margin. For example, if the motion vector accuracy is one quarter pixel in both horizontal and vertical dimensions, storing pre-calculated sub-pixel values for a complete image results in a memory usage that is 16 times that required to store the original, non-interpolated image. In addition, it involves the calculation of some sub-pixels which might not actually be required in calculating motion vectors in the encoder. Before-hand interpolation is also particularly inefficient in a video decoder, as the majority of pre-calculated sub-pixel values will never be required by the decoder. Thus, it is advantageous not to use pre-calculation in the decoder.
So-called ‘on-demand’ interpolation can be used to reduce memory requirements in the encoder. For example, if the desired pixel precision is quarter pixel resolution, only sub-pixels at one half unit resolution are interpolated before-hand for the whole frame and stored in the memory. Values of one-quarter pixel resolution sub-pixels are only calculated during the motion estimation/compensation process as and when it is required. In this case memory usage is only 4 times that required to store the original, non-interpolated image.
It should be noted that when before-hand interpolation is used, the interpolation process constitutes only a small fraction of the total encoder computational complexity/burden, since every pixel is interpolated just once. Therefore, in the encoder, the complexity of the interpolation process itself is not very critical when before-hand sub-pixel value interpolation is used. On the other hand, on-demand interpolation poses a much higher computational burden on the encoder, since sub-pixels may be interpolated many times. Hence the complexity of interpolation process, which may be considered in terms of the number of computational operations or operational cycles that must be performed in order to interpolate the sub-pixel values, becomes an important consideration.
In the decoder, the same sub-pixel values are used a few times at most and some are not needed at all. Therefore, in the decoder it is advantageous not to use before-hand interpolation at all, that is, it is advantageous not to pre-calculate any sub-pixel values.
Two interpolation schemes have been developed as part of the work ongoing in the ITU-Telecommunications Standardization Sector, Study Group 16, Video Coding Experts Group (VCEG), Questions 6 and 15. These approaches were proposed for incorporation into ITU-T recommendation H.26L and have been implemented in test models (TML) for the purposes of evaluation and further development. The test model corresponding to Question 15 is referred to as Test Model 5 (TML5), while that resulting from Question 6 is known as Test Model 6 (TML6). The interpolation schemes proposed in both TML5 and TML6 will now be described.
Throughout the description of the sub-pixel value interpolation scheme used in test model TML5, reference will be made to FIG. 12a, which defines a notation for describing pixel and sub-pixel locations specific to TML5. A separate notation, defined in FIG. 13a, will be used in the discussion of the sub-pixel value interpolation scheme used in TML6. A still further notation, illustrated in FIG. 14a, will be used later in the text in connection with the sub-pixel value interpolation method according to the invention. It should be appreciated that the three different notations used in the text are intended to assist in the understanding of each interpolation method and to help distinguish differences between them. However, in all three figures, the letter A is used to denote original image pixels (full pixel resolution). More specifically, the letter A represents the location of pixels in the image data representing a frame of a video sequence, the pixel values of pixels A being either received as current frame In(x,y) from a video source, or reconstructed and stored as a reference frame Rn(x,y) in the Frame Memory 17, 24 of the encoder 10 or the decoder 20. All other letters represent sub-pixel locations, the values of the sub-pixels situated at the sub-pixel locations being obtained by interpolation.
Certain other terms will also be used in a consistent manner throughout the text to identify particular pixel and sub-pixel locations. These are as follows:
The term ‘unit horizontal location’ is used to describe the location of any sub-pixel that is constructed in a column of the original image data. Sub-pixels c and e in FIGS. 12a and 13a, as well as sub-pixels b and e in FIG. 14a have unit horizontal locations.
The term ‘unit vertical location’ is used to describe any sub-pixel that is constructed in a row of the original image data. Sub-pixels b and d in FIGS. 12a and 13a as well as sub-pixels b and d in FIG. 14a have unit vertical locations.
By definition, pixels A have unit horizontal and unit vertical locations.
The term ‘half horizontal location’ is used to describe the location of any sub-pixel that is constructed in a column that lies at half pixel resolution. Sub-pixels b, c, and e shown in FIGS. 12a and 13a fall into this category, as do sub-pixels b, c and f in FIG. 14a. In a similar manner, the term ‘half vertical location’ is used to describe the location of any sub-pixel that is constructed in a row that lies at half-pixel resolution, such as sub-pixels c and d in FIGS. 12a and 13a, as well as sub-pixels b, c and g in FIG. 14a. 
Furthermore, the term ‘quarter horizontal location’ refers to any sub-pixel that is constructed in a column which lies at quarter-pixel resolution, such as sub-pixels d and e in FIG. 12a, sub-pixels d and g in FIG. 13a and sub-pixels d, g and h in FIG. 14a. Analogously, the term ‘quarter vertical location’ refers to sub-pixels that are constructed in a row which lies at quarter-pixel resolution. In FIG. 12a, sub-pixels e and f fall into this category, as do sub-pixels e, f and g in FIG. 13a and sub-pixels e, f and h in FIG. 14a. 
The definition of each of the terms described above is shown by ‘envelopes’ drawn on the corresponding figures.
It should further be noted that it is often convenient to denote a particular pixel with a two-dimensional reference. In this case, the appropriate two-dimensional reference can be obtained by examining the intersection of the envelopes in FIGS. 12a, 13a and 14a. Applying this principle, pixel d in FIG. 12a, for example, has a half horizontal and half vertical location and sub-pixel e has a unit horizontal and quarter vertical location. In addition, and for ease of reference, sub-pixels that reside at half unit horizontal and unit vertical locations, unit horizontal and half unit vertical locations as well as half unit horizontal and half unit vertical locations, will be referred to as ½ resolution sub-pixels. Sub-pixels which reside at any quarter unit horizontal and/or quarter unit vertical location will be referred to as ¼ resolution sub-pixels.
It should also be noted that in the descriptions of the two test models and in the detailed description of the invention itself, it will be assumed that pixels have a minimum value of 0 and a maximum value of 2n−1 where n is the number of bits reserved for a pixel value. The number of bits is typically 8. After a sub-pixel has been interpolated, if the value of that interpolated sub-pixel exceeds the value of 2n−1, it is restricted to the range of [0, 2n−1], i.e. values lower than the minimum allowed value will become the minimum value (0) and values larger than the maximum will the become maximum value (2n−1). This operation is called clipping.
The sub-pixel value interpolation scheme according to TML5 will now be described in detail with reference to FIGS. 12a, 12b and 12c.     1. The value for the sub-pixel at half unit horizontal and unit vertical location, that is ½ resolution sub-pixel b in FIG. 12a, is calculated using a 6-tap filter. The filter interpolates a value for ½ resolution sub-pixel b based upon the values of the 6 pixels (A1 to A6) situated in a row at unit horizontal locations and unit vertical locations symmetrically about b, as shown in FIG. 12b, according to the formula b=(A1−5A2+20A3+20A4−5A5+A6+16)/32. The operator / denotes division with truncation. The result is clipped to lie in the range [0, 2n−1].    2. Values for the ½ resolution sub-pixels labelled c are calculated using the same six tap filter as used in step 1 and the six nearest pixels or sub-pixels (A or b) in the vertical direction. Referring now to FIG. 12c, the filter interpolates a value for the ½ resolution sub-pixel c located at unit horizontal and half vertical location based upon the values of the 6 pixels (A1 to A6) situated in a column at unit horizontal locations and unit vertical locations symmetrically about c, according to the formula c=(A1−5A2+20A3+20A4−5A5+A6+16)/32. Similarly, a value for the ½ resolution sub-pixel c at half horizontal and half vertical location is calculated according to c=(b1−5b2+20b3+20b4−5b5+b6+16)/32. Again, the operator / denotes division with truncation. The values calculated for the c sub-pixels are further clipped to lie in the range [0, 2n−1].
At this point in the interpolation process the values of all ½ resolution sub-pixels have been calculated and the process proceeds to the calculation of ¼ resolution sub-pixel values.    3. Values for the ¼ resolution sub-pixels labelled d are calculated using linear interpolation and the values of the nearest pixels and/or ½ resolution sub-pixels in the horizontal direction. More specifically, values for ¼ resolution sub-pixels d located at quarter horizontal and unit vertical locations, are calculated by taking the average of the immediately neighbouring pixel at unit horizontal and unit vertical location (pixel A) and the immediately neighbouring ½ resolution sub-pixel at half horizontal and unit vertical location (sub-pixel b), i.e. according to d=(A+b)/2. Values for ¼ resolution sub-pixels d located at quarter horizontal and half vertical locations, are calculated by taking the average of the immediately neighbouring ½ resolution sub-pixels c which lie at unit horizontal and half vertical location and half horizontal and half vertical locations respectively, i.e. according to d=(c1+c2)/2. Again operator / indicates division with truncation.    4. Values for the ¼ resolution sub-pixels labelled e are calculated using linear interpolation and the values of the nearest pixels and/or ½ resolution sub-pixels in the vertical direction. In particular, ¼ resolution sub-pixels e at unit horizontal and quarter vertical locations are calculated by taking the average of the immediately neighbouring pixel at unit horizontal and unit vertical location (pixel A) and the immediately neighbouring sub-pixel at unit horizontal and half vertical location (sub-pixel c) according to e=(A+c)/2. ¼ resolution sub-pixels e3 at half horizontal and quarter vertical locations are calculated by taking the average of the immediately neighbouring sub-pixel at half horizontal and unit vertical location (sub-pixel b) and the immediately neighbouring sub-pixel at half horizontal and half vertical location (sub-pixel c), according to e=(b+c)/2. Furthermore, ¼ resolution sub-pixels e at quarter horizontal and quarter vertical locations are calculated by taking the average of the immediately neighbouring sub-pixels at quarter horizontal and unit vertical location and the corresponding sub-pixel at quarter horizontal and half vertical location (sub-pixels d), according to e=(d1+d2)/2. Once more, operator / indicates division with truncation.    5. The value for ¼ resolution sub-pixel f is interpolated by averaging the values of the 4 closest pixels values at unit horizontal and vertical locations, according to f=(A1+A2+A3+A4+2)/4, where pixels A1, A2, A3 and A4 are the four nearest original pixels.
A disadvantage of TML5 is that the decoder is computationally complex. This results from the fact that TML5 uses an approach in which interpolation of ¼ resolution sub-pixel values depends upon the interpolation of ½ resolution sub-pixel values. This means that in order to interpolate the values of the ¼ resolution sub-pixels, the values of the ½ resolution sub-pixels from which they are determined must be calculated first. Furthermore, since the values of some of the ¼ resolution sub-pixels depend upon the interpolated values obtained for other ¼ resolution sub-pixels, truncation of the ¼ resolution sub-pixel values has a deleterious effect on the precision of some of the ¼ resolution sub-pixel values. Specifically, the ¼ resolution sub-pixel values are less precise than they would be if calculated from values that had not been truncated and clipped. Another disadvantage of TML5 is that it is necessary to store the values of the ½ resolution sub-pixels in order to interpolate the ¼ resolution sub-pixel values. Therefore, excess memory is required to store a result which is not ultimately required.
The sub-pixel value interpolation scheme according to TML6, referred to herein as direct interpolation, will now be described. In the encoder the interpolation method according to TML6 works like the previously described TML5 interpolation method, except that maximum precision is retained throughout. This is achieved by using intermediate values which are neither rounded nor clipped. A step-by-step description of interpolation method according to TML6 as applied in the encoder is given below with reference to FIGS. 13a, 13b and 13c.     1. The value for the sub-pixel at half unit horizontal and unit vertical location, that is ½ resolution sub-pixel b in FIG. 13a, is obtained by first calculating an intermediate value b using a six tap filter. The filter calculates b based upon the values of the 6 pixels (A1 to A6) situated in a row at unit horizontal locations and unit vertical locations symmetrically about b, as shown in FIG. 13b, according to the formula b=(A1−5A2+20A3+20A4−5A5+A6). The final value of b is then calculated as b=(b+16)/32 and is clipped to lie in the range [0, 2n−1]. As before, the operator / denotes division with truncation.    2. Values for the ½ resolution sub-pixels labelled c are obtained by first calculating intermediate values c. Referring to FIG. 13c, an intermediate value c for the ½ resolution sub-pixel c located at unit horizontal and half vertical location is calculated based upon the values of the 6 pixels (A1 to A6) situated in a column at unit horizontal locations and unit vertical locations symmetrically about c, according to the formula c=(A1−5A2+20A3+20A4−5A5+A6). The final value for the ½ resolution sub-pixel c located at unit horizontal and half vertical location is then calculated according to c=(c+16)/32. Similarly, an intermediate value c for the ½ resolution sub-pixel c at half horizontal and half vertical location is calculated according to c=(b1−5b2+20b3+20b4−5b5+b6). A final value for this ½ resolution sub-pixel is then calculated according to (c+512)/1024. Again, the operator / denotes division with truncation and the values calculated for ½ resolution sub-pixels c are further clipped to lie in the range [0, 2n−1].    3. Values for the ¼ resolution sub-pixels labelled d are calculated as follows. Values for ¼ resolution sub-pixels d located at quarter horizontal and unit vertical locations, are calculated from the value of the immediately neighbouring pixel at unit horizontal and unit vertical location (pixel A) and the intermediate value b calculated in step (1) for the immediately neighbouring ½ resolution sub-pixel at half horizontal and unit vertical location (½ resolution sub-pixel b), according to d=(32A+b+32)/64. Values for ¼ resolution sub-pixels d located at quarter horizontal and half vertical locations, are interpolated using the intermediate values c calculated for the immediately neighbouring ½ resolution sub-pixels c which lie at unit horizontal and half vertical location and half horizontal and half vertical locations respectively, according to d=(32c1+c2+1024)/2048. Again operator / indicates division with truncation and the finally obtained ¼ resolution sub-pixel values d are clipped to lie in the range [0, 2n−1].    4. Values for the ¼ resolution sub-pixels labelled e are calculated as follows. Values for ¼ resolution sub-pixels e located at unit horizontal and quarter vertical locations are calculated from the value of the immediately neighbouring pixel at unit horizontal and unit vertical location (pixel A) and the intermediate value c calculated in step (2) for the immediately neighbouring ½ resolution sub-pixel at unit horizontal and half vertical location, according to e=(32A+c+32)/64. Values for ¼ resolution sub-pixels e located at half horizontal and quarter vertical locations are calculated from the intermediate value b calculated in step (1) for the immediately neighbouring ½ resolution sub-pixel at half horizontal and unit vertical location and the intermediate value c calculated in step (2) for the immediately neighbouring ½ resolution sub-pixel at half horizontal and half vertical location, according to e=(32b+c+1024)/2048. Once more, operator / indicates division with truncation and the finally obtained ¼ resolution sub-pixel values e are clipped to lie in the range [0, 2n−1].    5. Values for ¼ resolution sub-pixels labelled g are computed using the value of the nearest original pixel A and the intermediate values of the three nearest neighbouring ½ resolution sub-pixels, according to g=(1024A+32b+32c1+c2+2048)/4096. As before, operator / indicates division with truncation and the finally obtained for ¼ resolution sub-pixel values g are clipped to lie in the range [0, 2n−1].    6. The value for ¼ resolution sub-pixel f is interpolated by averaging the values of the 4 closest pixels at unit horizontal and vertical locations, according to f=(A1+A2+A3+A4+2)/4, where the locations of pixels A1, A2, A3 and A4 are the four nearest original pixels.
In the decoder, sub-pixel values can be obtained directly by applying 6-tap filters in horizontal and vertical directions. In the case of ¼ sub-pixel resolution, referring to FIG. 13a, the filter coefficients applied to pixels and sub-pixels at unit vertical location are [0, 0, 64, 0, 0, 0] for a set of six pixels A, [1, −5, 52, 20, −5, 1] for a set of six sub-pixels d, [2, −10, 40, 40, −10, 2] for a set of six sub-pixels b, and [1, −5, 20, 52, −5, 1] for a set of six sub-pixels d. These filter coefficients are applied to respective sets of pixels or sub-pixels in the same row as the sub-pixel values being interpolated.
After applying the filters in the horizontal and vertical directions, interpolated value c is normalized according to c=(c+2048)/4096 and clipped to lie in the range [0, 2n−1]. When a motion vector points to an integer pixel position in either the horizontal or vertical direction, many zero coefficients are used. In a practical implementation of TML6, different branches are used in the software which are optimised for the different sub-pixel cases so that there are no multiplications by zero coefficients.
It should be noted that in TML6, ¼ resolution sub-pixel values are obtained directly using the intermediate values referred to above and are not derived from rounded and clipped values for ½ resolution sub-pixels. Therefore, in obtaining the ¼ resolution sub-pixel values, it is not necessary to calculate final values for any of the ½ resolution sub-pixels. Specifically, it is not necessary to carry out the truncation and clipping operations associated with the calculation of final values for the ½ resolution sub-pixels. Neither is it necessary to have stored final values for ½ resolution sub-pixels for use in the calculation of the ¼ resolution sub-pixel values. Therefore TML6 is computationally less complex than TML5, as fewer truncating and clipping operations are required. However, a disadvantage of TML6 is that high precision arithmetic is required both in the encoder and in the decoder. High precision interpolation requires more silicon area in ASICs and requires more computations in some CPUs. Furthermore, implementation of direct interpolation as specified in TML6 in an on-demand fashion has a high memory requirement. This is an important factor, particularly in embedded devices.
In view of the previously presented discussion, it should be appreciated that due to the different requirements of the video encoder and decoder with regard to sub-pixel interpolation, there exists a significant problem in developing a method of sub-pixel value interpolation capable of providing satisfactory performance in both the encoder and decoder. Furthermore, neither of the current test models (TML5, TML6) described in the foregoing can provide a solution that is optimum for application in both encoder and decoder.