1. Field of the Invention
The present invention relates to a video coding/decoding system and a video coder and a video decoder used with the same system for implementing a motion compensation method in which all the pixels associated with the same patch are not restricted to have a common motion vector and in which the horizontal and vertical components of a motion vector of a pixel can assume an arbitrary value other than an integral multiple of the distance between adjacent pixels.
2. Description of the Related Art
In the high-efficiency coding and decoding of image sequences, a motion compensation method utilizing the analogy between temporally-proximate frames is well known to have a great advantage in compressing the amount of information.
FIGS. 1A and 1B are diagrams showing a general circuit configuration of a video coder 1 and a video decoder 2 to which the motion compensation method described above is applied.
In FIG. 1A, a frame memory 2-1 has stored therein a reference image R providing a decoded image of the previous frame already coded. A motion estimation section 3-1 estimates a motion and outputs motion information using the original image I of the current frame to be coded and the reference image R read out of the frame memory 2-1. A predicted image synthesis circuit 4-1 synthesizes a predicted image P for the original image I using the motion information and the reference image R. A subtractor 5-1 calculates the difference between the original image I and the predicted image P and outputs a prediction error. The prediction error is subjected to the DCT conversion or the like at a prediction error coder 6-1, and transmits the prediction error information together with the motion information to the receiving end. At the same time, the prediction error information is decoded by the inverted DCT conversion or the like at a prediction error decoder 7-1. An adder 8-1 adds the coded prediction error to the predicted image P and outputs a decoded image of the current frame. The decoded image of the current frame is newly stored in the memory 2-1 as a reference image R.
In FIG. 1B, a frame memory 2xe2x80x942 has stored therein a reference image R providing a decoded image of the previous frame. A synthesis circuit 4-2 synthesizes a predicted image P using the reference image R read out of the frame memory 2xe2x80x942 and the motion information received. The received prediction error information is decoded by being subjected to the inverse DCT conversion or the like by a prediction error decoder 7-2. An adder 8-2 adds the decoded prediction error to the predicted image P and outputs a decoded image of the current frame. The decoded image of the current frame is newly stored in the frame memory 2xe2x80x942 as a reference image P.
A motion compensation method constituting the main stream of the current video coding and decoding techniques depends on the xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d employed by MPEG1 and MPEG2 providing the international standard of video coding/decoding method.
In the xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d, the original image of the current frame to be coded is segmented into a number n of blocks at the motion estimation section 3-1 in FIG. 1A, and a motion vector is determined for each block as a motion information. The horizontal and vertical components of this motion vector have a minimum unit length equal to one-half of the distance between horizontally and vertically adjacent pixels, respectively. In the description that follows, let the horizontal component of the motion vector of the ith block (1xe2x89xa6ixe2x89xa6n) be ui and the vertical component thereof be vi. In a method most widely used for estimating the motion vector (ui,vi), a search range such as xe2x88x9215xe2x89xa6uixe2x89xa615, xe2x88x9215xe2x89xa6vixe2x89xa615 is predetermined, and a motion vector (ui,vi) which minimizes the prediction error Ei(ui,vi) in the block is searched for. The prediction error Ei(ui,vi) is expressed by Equation 1 using a mean absolute error (MAE) as an evaluation standard.                               Ei          ⁡                      (                          ui              ,              vi                        )                          =                              1            Ni                    ⁢                                    ∑                                                (                                      x                    ,                    y                                    )                                ∈                BI                                      ⁢                          "LeftBracketingBar"                                                I                  ⁡                                      (                                          x                      ,                      y                                        )                                                  -                                  R                  ⁡                                      (                                                                  x                        -                        ui                                            ,                                              y                        -                        vi                                                              )                                                              "RightBracketingBar"                                                          (        1        )            
In Equation 1, I(x,y) denotes the original image of the current frame to be coded, and R(x,y) a reference image stored in memory. In this equation, it is assumed that pixels exist at points of which the x and y coordinates are an integer on the original image I and the reference image R. Bi designates the pixels contained in the ith block of the original image I, and Ni the number of pixels contained in the ith block of the original image I. The process of evaluating the prediction error for motion vectors varying from one block to another and searching for a motion vector associated with the smallest prediction error is called the matching. Also, the process of calculating Ei(ui,vi) for all vectors (ui,vi) conceivable within a predetermined search range and searching for the minimum value of the vector is called the full search.
In the motion estimation for the xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d, ui and vi are determined with one half of the distance between adjacent pixels, i.e., xc2xd as a minimum unit. As a result, (xxe2x88x92ui,yxe2x88x92vi) is not necessarily an integer, and a luminance value of a point lacking a pixel must actually be determined on the reference image R when calculating the prediction error using Equation 1. The process for determining the luminance value of a point lacking a pixel is called the interpolation, and the point where interpolation is effected is referred to as an interpolated point or an intermediate point. A bilinear interpolation is often used as an interpolation process using four pixels around the interpolated point.
When the process of bilinear interpolation is described in a formula, the luminance value R(x+p,y+q) at the interpolated point (x+p,y+q) of the reference image R can be expressed by Equation 2 with the fractional components of the coordinate value of the interpolated point given as p and q (0xe2x89xa6p less than 1, 0xe2x89xa6q less than 1).
R(x+p,y+q)=(1xe2x88x92q){(1xe2x88x92p)R(x,y)+pR(x+1,y)}+q{(1xe2x88x92p)R(x,y+1)+pR(x+1,y+1)}xe2x80x83xe2x80x83(2) 
In the motion estimation by xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d, a two-step search is widely used in which, first, the full-search of single-pixel accuracy is effected for a wide search range to estimate a motion vector approximately, followed by the full search of half-pixel accuracy for a very small range defined by, say, plus/minus a half pixel in horizontal and vertical directions around the motion vector. In the second-step search, a method is frequently used in which the luminance value of an interpolated point on the reference image R is determined in advance. An example of the process according to this method is shown in FIGS. 2A, B, C and D. In this example, a block containing four pixels each in longitudinal and lateral directions is used. In FIGS. 2A, B, C and D, the points assuming an integral coordinate value and originally having a pixel in a reference image are expressed by a white circle ∘, and the interpolated points for which a luminance value is newly determined are represented by X. Also, the pixels in a block of the original image of the current frame are expressed by a white square . The motion vector obtained by the first-step search is assumed to be (uc,vc). FIG. 2A shows the state of matching when the motion vector is (uc,uv) in the first-step search. The prediction error is evaluated between each pair of ∘ and  overlapped. FIGS. 2B, C and D show the case in which the motion vector is (uc+xc2xd,vc), (uc+xc2xd,vc+xc2xd), (uc-xc2xd,vc-xc2xd) in the second-step search. The prediction error is evaluated between each overlapped pair of X and  in FIGS. 2B, C and D. As seen from these drawings, in the case where the range for the second-step search is xc2x1xc2xd pixel each in longitudinal and lateral directions, the matching process for eight motion vectors ((uc,vcxc2x1xc2xd), (ucxc2x1xc2xd,vc), (uc+xc2xd,vcxc2x1xc2xd), (ucxe2x88x92xc2xd, vcxc2x1xc2xd) can be accomplished by determining the luminance value of 65 (=the number of X in each drawing) interpolated points in advance. In the process, all the interpolated points of which the luminance value was determined are used for matching.
On the other hand, assuming that the interpolation calculation is made on a reference image each time of matching, a total of 128 (=16xc3x978, in which 16 is the number of white squares in FIGS. 2B, C and D, and 8 is the number of times the matching is made) interpolations would be required.
As described above, the number of interpolation operations can be reduced by determining the luminance value of the interpolated points on the reference image R in advance by reason of the fact that the same interpolated point on the reference image R is used a plurality of times.
Also, in the xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d, a predicted image is synthesized using the relation of Equation 3 in the synthesis circuits 4-1, 4-2 shown in FIGS. 1A and 1B.
P(x,y)=R(xxe2x88x92ui,yxe2x88x92vi),(x,y)∈Bi(1xe2x89xa6ixe2x89xa6n)xe2x80x83xe2x80x83(3) 
In Equation 3, P(x,y) shows an original image I(x,y) of the current frame to be coded which is predicted by use of the reference image R(x,y) and the motion vector (ui,vi). Also, assuming that the predicted image P is segmented into a number n of blocks corresponding to the original image I, Bi represents a pixel contained in the ith block of the predicted image P.
In the xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d, as described above, the value of (xxe2x88x92ui,yxe2x88x92vi) is not necessarily an integer, and therefore the interpolation process such as the bilinear interpolation using Equation 2 is carried out in synthesizing a predicted image.
The xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d is currently widely used as a motion compensation method. Applications requiring an information compression ratio higher than MPEG1 and MPEG2, however, demand an even more sophisticated motion compensation method. The disadvantage of the xe2x80x9cblock matchingxe2x80x9d method is that all the pixels in the same block are required to have the same motion vector.
In order to solve this problem, a motion compensation method allowing adjacent pixels to have different motion vectors has recently been proposed. The xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d which is an example of such a method is briefly explained below.
In the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d, the relation between the predicted image P and the reference image R in synthesizing a predicted image at the synthesis circuit 4-1, 4-2 in FIGS. 1A and 1B is expressed by Equation 4 below.
P(x,y)=R(fi(x,y),gi(x,y)),(x,y)∈Pi(1xe2x89xa6ixe2x89xa6n)xe2x80x83xe2x80x83(4) 
In Equation 4, on the assumption that the predicted image P is segmented into a number n of patches corresponding to the original image I, Pi represents a pixel contained in the ith patch of the predicted image P. Also, the transformation functions fi(x,y) and gi(x,y) represent a spatial correspondence between the predicted image P and the reference image R. The motion vector for a pixel (x,y) in Pi can be represented by (xxe2x88x92fi(x,y),yxe2x88x92gi(x,y)). The predicted image P is synthesized by calculating the transformation functions fi(x,y), gi(x,y) with respect to each pixel in each patch and determining the luminance value of corresponding points in the reference image R in accordance with Equation 4. In the process, (fi(x,y), gi(x,y)) is not necessarily an integer, and therefore the interpolation process such as the bilinear interpolation is performed using Equation 3 as in the case of the xe2x80x9cblock matching of half-pixel accuracyxe2x80x9d.
The xe2x80x9cblock matchingxe2x80x9d can be interpreted as a special case of the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d in which the transformation function is a constant.
Nevertheless, the words xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d as used in the present specification are not assumed to include the xe2x80x9cblock matchingxe2x80x9d.
Examples of the transformation functions fi(x,y), gi(x,y) in the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d include the case using the affine transformation shown in Equation 5 (refer to xe2x80x9cBasic Study of Motion Compensation Based on Triangular Patchesxe2x80x9d by Nakaya, et al., Technical Report of IEICE, IE90-106, H2-03) shown below
fi(x,y)=ai1x+ai2y+ai3gi(x,y)=ai4x+ai5y+ai6xe2x80x83xe2x80x83(5), 
the case using the bilinear transformation given in Equation 6 (G. J. Sullivan and R. L. Baker, xe2x80x9cMotion compensation for video compression using control grid interpolationxe2x80x9d, Proc. ICASSP ""91, M9.1, pp.2713-2716, 1991-05) shown below
fi(x,y)=bi1xy+bi2x+bi3y+bi4gi(x,y)=bi5xy+bi6x+bi7y+bi8xe2x80x83xe2x80x83(6), 
and the case using the perspective transformation given in Equation 7 (V. Seferdis and M. Ghanbari, xe2x80x9cGeneral approach to block-matching motion estimationxe2x80x9d, Optical Engineering, vol. 32, no. 7, pp. 1464-1474, 1993-07) shown below                                           fi            ⁡                          (                              x                ,                y                            )                                =                                    ci4x              +              ci5y              +              ci6                                      ci1x              +              ci2y              +              ci3                                      ⁢                  
                ⁢                              gi            ⁡                          (                              x                ,                y                            )                                =                                    ci7x              +              ci8y              +              ci9                                      ci1x              +              ci2y              +              ci3                                                          (        7        )            
In Equations 5, 6 and 7, aij, bij, cij (j: 1 to 9) designate motion parameters estimated for each patch as motion information at the motion estimation section 3-1 in FIG. 1A. An image identical to the predicted image P produced at the synthesis circuit 4-1 of the video coder 1 can be obtained at the synthesis circuit 4-2 of the video decoder 2 at the receiving end in such a manner that information capable of specifying the motion parameter of the transformation function for each patch in some form or other is transmitted by the video coder 1 as motion information to the video decoder 2 at the receiving end. Assume, for example, that the affine transformation (Equation 5) is used as the transformation function and the patch is triangular in shape. In such a case, six motion parameters can be transmitted directly as motion information. Alternatively, the motion vectors of three vertices of a patch may be transmitted so that six motion parameters indicated by Equation 5 are calculated from the motion vectors of the three vertices at the receiving end. Also, in the case where the bilinear transformation (Equation 6) is used as the transformation function, the employment of a quadrilateral patch makes it possible to transmit the desired one of eight motion parameters and the motion vectors of four vertices of the patch.
The following explanation refers to the case using the affine transformation (Equation 5) as the transformation function. This explanation applies substantially directly with equal effect to the case where other transformations (Equation 6, 7, etc.) are employed.
Even after a transformation function is established, many variations are conceivable for the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d. An example is shown in FIG. 3. In this case, the motion vector is restricted to continuously change at the patch boundary. First, an original image I202 of the current frame is segmented into a plurality of polygonal patches, thereby constituting a patch-segmented original image I208. The vertices of these patches are called the grid points, each of which is shared by a plurality of patches. A patch 209 in FIG. 3, for example, is composed of grid points 210, 211, 212, which function also as vertices of other patches. After the original image I202 is segmented into a plurality of patches in this way, motion estimation is performed. In the shown example, motion estimation is performed with a reference image R201 with respect to each grid point. As a result, each patch is deformed on the reference image R203 after motion estimation. The patch 209, for instance, corresponds to the deformed patch 204. This is by reason of the fact that the grid points 205, 206, 207 on the original image I208 are estimated to have been translated to the grid points 210, 211, 212 respectively on the reference image R203 as a result of motion estimation. Since most of the grid points are shared by multiple patches in this example, the amount of transmitted data can be reduced by transmitting the motion vectors of the grid points rather than transmitting the affine transformation parameters of each patch.
In the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d, as in the xe2x80x9cblock matchingxe2x80x9d, it is pointed out that the motion estimation based on matching is effective. An example algorithm for motion estimation based on matching is described below. This scheme is called the xe2x80x9chexagonal matchingxe2x80x9d and is effectively applied to the case where the motion vector continuously changes at the patch boundary. This scheme is configured of two processes:
(1) Coarse motion estimation of grid points by xe2x80x9cblock matchingxe2x80x9d; and
(2) Correction of motion vector by xe2x80x9crefinement algorithmxe2x80x9d.
In process (1), the block matching is applied to a block of a given size containing a grid point, and the motion vector of this block is determined as a coarse motion vector for the grid points existing in the particular block. The object of process (1) is nothing but to determine a coarse motion vector of a grid point and is not always achieved using the block matching. The manner in which process (2) is carried out is shown in FIG. 4. FIG. 4 shows a part of a patch and grid points in the reference image R which corresponds to the reference image R203 in FIG. 3. Thus, changing the position of a grid point in FIG. 4 is indicative of changing the motion vector of the same grid point. In refining the motion vector of the grid point 301, the first thing to do is to fix the motion vectors of the grid points 303 to 308 representing the vertices of a polygon 302 configured of all the patches involving the grid point 301. The motion vector of the grid point 301 is changed with a predetermined search range in this way. For example, the grid point 301 is translated to the position of the grid point 309. As a result, the prediction error within each patch contained by the polygon 302 also undergoes a change. The motion vector minimizing the prediction error within the polygon 302 in the search range is registered as a refined motion vector of the grid point 301. The refinement of the motion vector of the grid point 301 is thus completed, and a similar operation of refinement is continued by translating to another grid point. Once all the grid points are refined, the prediction error can be further reduced by repeating the refinement from the first grid point. The appropriate number of repetitions of the refinement process is reported to be two or three.
A typical search range for the refinement algorithm is xc2x13 pixels in each of horizontal and vertical directions. In such a case, a total of 49 (=7xc3x977) matching operations are performed for each grid point in the polygon 302. Since each patch is involved in the refinement algorithm for three grid points, on the other hand, it follows that a total of 147 (=49xc3x973) evaluations of prediction error is performed for each pixel in a patch. Further, each repetition of this refinement process increases the number of prediction error evaluations correspondingly. Consequently, each time of prediction error evaluation, interpolation computations are carried out for the interpolated points involved on the reference image, thereby enormously increasing the amount of computations.
The problem of interpolation computation in the motion estimation for the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d is complicated due to the essential difference thereof from the similar problem in the motion estimation for the xe2x80x9cblock matching at half-pixel accuracyxe2x80x9d. In the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d, even when the horizontal and vertical components of the motion vector of each grid point are restricted to an integral multiple of xc2xd, the horizontal and vertical components of the motion vector of each pixel in each patch are not necessarily an integral multiple of xc2xd. Also, in view of the fact that the components below the decimal point of the motion vector for each pixel in each patch generally can assume an arbitrary value, the luminance value of the same interpolated point on the reference image R is rarely used a plurality of times in the matching operation.
The feature of the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d is that a numerical operation is required for determining a motion vector for each pixel. In the case where the computation accuracy varies between the transmitting and receiving ends in computing a motion vector (transformation function), a mismatch may occur in which the predicted image P obtained at the synthesis circuit 4-1 of the video coder 1 is different from the predicted image P produced from the synthesis circuit 4-2 of the video decoder 2. This mismatch of the predicted image P has the property of accumulating at the receiving end. Even when there is only a small error for each frame, therefore, the quality of the decoded image output from the video decoding circuit 2 may be seriously affected in the end. This problem is not posed by the xe2x80x9cblock matchingxe2x80x9d in which all the pixels in a block follow the same motion vector and this particular motion vector is coded and transmitted directly as motion information.
An example of employing the affine transformation (Equation 5) as a transformation function to cope with this problem is explained. A method of solving such a problem is by enhancing the computation accuracy of Equation 5 sufficiently to reduce the computation error of Equation 5 sufficiently below the quantization step size of the luminance value. A case using this solution is studied below.
Assume, for example, that the luminance value is quantized in 8 bits with the quantization step size of 1 and that the maximum value of the luminance value is 255 (11111111) and the minimum value thereof is 0 (00000000). Also, assume that the luminance values of four adjacent pixels on the reference image P are R(0,0)=0, R(0,1)=0, R(1,0)=255, and R(1,1)=255, respectively. Further, it is assumed that the computation of Equation 5 is carried out to determine fi(x,y) when the horizontal and vertical coordinates of a point on the reference image R corresponding to a pixel P(x,y) on the predicted image P are given by 0 less than gi(x,y) less than 1 and 0 less than fi(x,y) less than 1, respectively. This condition is hereinafter referred to as the worst condition.
Under this worst condition, a computation error more than {fraction (1/255)} in magnitude of fi(x,y) always leads to an error of the quantized value of the luminance. For a mismatch to be prevented, therefore, both the video coder 1 and the video decoder 2 must be fabricated in such a manner as to secure the computation error of Equation 5 sufficiently smaller than {fraction (1/255)}. Improving the computation accuracy, however, generally leads to an increased number of digits for internal expression of a numerical value, thereby further complicating the computation process. In the motion compensation process, Equation 5 is computed so frequently that an increased complication of this computation process has a serious adverse effect on the total amount of information processed.
With the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d, motion estimation based on matching poses the problem of a greatly increased amount of computations required for interpolation of luminance values at points lacking a pixel on the reference image R. A more complicated computation operation is another problem which will be posed if the computation accuracy for synthesizing each predicted image P in the video coder and the video decoder is to be improved to accommodate a mismatch between a predicted image P obtained at the sending end and a predicted image P obtained at the receiving end.
An object of the present invention is to realize a motion estimation process with a small amount of computations by reducing the number of calculations for interpolation of luminance values.
Another object of the invention is to provide a method of reducing the computation accuracy required for computing the transformation function at the time of synthesizing a predicted image P and also preventing the mismatch between the predicted images P attributable to the computation accuracy of the transformation function.
Prior to motion estimation, a high-resolution reference image Rxe2x80x2 is prepared for which the luminance value of a point having x and y coordinates equal to an integral multiple of 1/m1 and 1/m2 (m1 and m2 are positive integers) respectively is determined by interpolation on the reference image R. It follows therefore that in the high-resolution reference image Rxe2x80x2, pixels exist at points whose x and y coordinate values are an integral multiple of 1/m1 and 1/m2 respectively. In the case where the luminance value of the reference image R at a position having a coordinate value other than an integer becomes required in the process of motion estimation, such a value is approximated by the luminance value of a pixel existing at a position nearest to the particular coordinate in the high-resolution reference image Rxe2x80x2. The object of reducing the number of interpolation computations thus is achieved.
In the above-mentioned process for preparing the high-resolution reference image Rxe2x80x2, interpolation computations in the number of m1xc3x97m2xe2x88x921 per pixel of the original image I are required. Once the interpolation process for achieving a high resolution is completed, however, the motion estimation process does not require any further computations for interpolation. In the case of the xe2x80x9cmotion compensation based on spatial transformationxe2x80x9d described with reference to the related art above, more than 147 interpolation computations is required for each pixel in the motion estimation. When it is assumed that m1=m2=2, the number of required interpolation computations is not more than three per pixel or about one fiftieth of the conventional requirement. Even when m1=m2=4, the number of required interpolation computations is only 15, which is as small as about one tenth. The computation amount thus can be reduced remarkably.
Also, assume that the horizontal and vertical components of the motion vector of each pixel used for synthesizing the predicted image P in the video coder and the video decoder are defined to take a value equivalent only to an integral multiple of 1/d1 or 1/d2 (d1 and d2 being integers) respectively of the distance between adjacent pixels. The object of reducing the required computation accuracy of the transformation function and preventing a mismatch is thus achieved.
In the case where the above-mentioned rule on motion vectors is employed, the magnitude of the computation error of the transformation function fi(x,y) always leading to an error of the quantization value of luminance under the xe2x80x9cworst conditionxe2x80x9d described with reference to the related art above is 1/d1. Suppose d1=4, for example, the risk of causing a mismatch of the predicted images under the xe2x80x9cworst conditionxe2x80x9d is maintained substantially at the same level even when the computation accuracy of fi(x,y) is reduced by 6 bits as compared with the proposed solution described above with reference to the related art.
The foregoing and other objects, advantages, manner of operation and novel features of the present invention will be understood from the following detailed description when read in conjunction with the accompanying drawings.