In video encoding, in the case of inter-frame prediction (motion compensation) encoding which performs prediction between different frames, a motion vector is obtained by referring to an already decoded frame so as to minimize the prediction error energy and the like, a residual signal is subjected to orthogonal transform, quantization is applied, and entropy encoding is performed, and binary data is obtained thereby. In order to increase the coding efficiency, it is indispensable to reduce the prediction error energy, and a prediction scheme that provides higher prediction accuracy is required.
A great number of tools that increase the accuracy of inter-frame prediction have been introduced into video encoding standards. For example, if there is occlusion in a nearest frame, the prediction error energy can be further reduced by referring to a frame that is distant in the time domain to some extent, and thus, in H.264/AVC, a plurality of frames can be referred to. This tool is called multiple reference frame prediction.
In addition, in order to deal with motions having complex shapes, a block size can be subdivided, such as 16×8, 8×16, 8×4, 4×8, and 4×4, in addition to 16×16 and 8×8. This tool is called variable block size prediction.
Similar to these, pixels of ½ accuracy are interpolated from integer-accuracy pixels of a reference frame using a 6-tap filter, and then pixels of ¼ accuracy are generated by linear interpolation using these pixels. Accordingly, it becomes possible to realize accurate prediction for motions of fractional accuracy. This tool is called ¼ pixel accuracy prediction. In this Description, “fractional accuracy” refers to arranging a motion vector on a fractional-pixel basis having higher accuracy than that of an integer multiple of a pixel spacing or arranging a motion vector at an integer multiple position. For example, the position obtained by dividing each pixel spacing into two exact halves is called ½ accuracy, and the positions obtained by dividing each pixel spacing into three equal parts are called ⅓ accuracy.
In order to develop next-generation video coding standards that provide higher coding efficiency than that of the H.264/AVC, various proposals are now being gathered from all over the world by the international organization for standardization ISO/IEC “MPEG” (International Organization for Standardization/International Electrotechnical Commission “Moving Picture Experts Group”) and ITU-T “VCEG” (International Telecommunication Union-Telecommunication Standardization Sector “Video Coding Experts Group”). Among them, in particular, many proposals relating to inter-frame prediction (motion compensation) have been presented, and software for the next-generation video coding that is being drawn up under the leadership of the VCEG (hereinafter referred to as KTA (Key Technical Area) software) employs a tool for reducing the bit-rates of motion vectors, a tool for extending the block size to 16×16 or larger, and the like.
In particular, a tool for adaptively changing interpolation filter coefficients for a fractional-accuracy pixel is called an adaptive interpolation filter, it is effective for almost all the pictures, and it was first adopted in the KTA software. This technology is also employed in a lot of contributions to a call for proposal of a next-generation video coding test model issued by the group JCT-VC (Joint Collaborative Team on Video Coding), which is being jointly promoted by the MPEG and the VCEG for development of next-generation video coding standards.
Moreover, in addition to the adaptive interpolation filter, a method for improving a fixed interpolation filter is also proposed in which a plurality of sets of fixed interpolation filter coefficients are prepared, an optimum set is selected therefrom, and interpolation is performed, and they are introduced into a test model under consideration TMuC of the JCT-VC. Since the above-described methods for improving the interpolation filter highly contribute to an improvement in coding efficiency, it is one of very expectative fields, along with an in-loop filter and block size extension (a method using a size of 16×16, which is a conventional size, or larger, such as 32×32 and 64×64) that similarly provide a high degree of effectiveness.
[Fixed Interpolation]
In the H.264/AVC, as shown in FIG. 14, when the position of a ½ pixel is interpolated, interpolation is performed using 6 integer pixels including three points on the left side of the pixel to be interpolated and three points on the right side of the pixel to be interpolated. With respect to the vertical direction, interpolation is performed using 6 integer pixels including three points on the upper side and three points on the lower side. The filter coefficients are [(1, −5, 20, 20, −5, 1)/32]. After the positions of ½ pixels have been interpolated, the positions of ¼ pixels are interpolated using a mean filter of [½, ½]. Since it is necessary to interpolate the positions of all the ½ pixels, the computational complexity is high, but high-performance interpolation is possible, so that the coding efficiency is improved. Non-Patent Document 1 and Non-Patent Document 2 disclose the details of the above fixed interpolation filter.
In order to improve the performance of an interpolation filter of the H.264/AVC, a technology in which a plurality of sets of fixed interpolation filter coefficients are prepared and the interpolation filter coefficients are flexibly switched for each frame has been proposed. This scheme is called a switched interpolation filter with offset (hereinafter referred to as SIFO), and it is a technology of improving the coding efficiency by separately calculating an offset for adjusting a luminance signal and transmitting the offset in addition to an interpolation filter. As an improvement of this mechanism, in order to reduce the computational cost for switching the interpolation filter, a technology of executing, in a single pass, a determination of an interpolation filter used in the current frame using information on past frames that have been encoded has also been proposed. The above matters are disclosed in Non-Patent Document 3 and Non-Patent Document 4.
[Adaptive Interpolation]
In the H.264/AVC, the values of filter coefficients are constant, irrespective of conditions of an input picture (the type of a sequence, the size of a picture, and a frame rate) and encoding conditions (the block size, the structure of a GOP (group of pictures), and QP (quantization parameters)). When the values of filter coefficients are fixed, for example, effects that vary over time, such as aliasing, a quantization error, an error resulting from motion estimation, and camera noise, are not taken into consideration. Therefore, it is considered that improvement in performance is limited in terms of the coding efficiency. Accordingly, Non-Patent Document 5 proposes a scheme of adaptively changing interpolation filter coefficients, which is called a non-separable adaptive interpolation filter.
In Non-Patent Document 5, a two-dimensional interpolation filter (6×6=36 filter coefficients) is intended, and the filter coefficients are determined so as to minimize the prediction error energy. Although it is possible to realize higher coding efficiency than that obtained by using a one-dimensional 6-tap fixed interpolation filter employed in the H.264/AVC, the computational complexity for obtaining filter coefficients is very high, and thus Non-Patent Document 6 introduces a proposal for reducing the computational complexity.
The technique introduced in the Non-Patent Document 6 is called a separable adaptive interpolation filter (SAIF), and it uses a one-dimensional 6-tap interpolation filter rather than a two-dimensional interpolation filter.
FIG. 15A to FIG. 15C are diagrams illustrating a method for interpolating a pixel of non-integer accuracy in the separable adaptive interpolation filter (SAIF). Its procedure is such that, first, as shown by step 1 in FIG. 15B, pixels in the horizontal direction (a, b, and c) are interpolated. Integer-accuracy pixels C1 to C6 are used for determining filter coefficients. Filter coefficients in the horizontal direction that minimize a prediction error energy function Ex2 of Equation (1) are analytically determined by the commonly known least square method (see Non-Patent Document 5).
                    [                  Equation          ⁢                                          ⁢          1                ]                                                                      E          x          2                =                              ∑                          x              ,              y                                ⁢                                          ⁢                                    (                                                S                                      x                    ,                    y                                                  -                                                      ∑                                          c                      i                                                        ⁢                                                                          ⁢                                                            w                                              c                        i                                                              ·                                          P                                                                                                    x                            ~                                                    +                                                      c                            i                                                                          ,                                                  y                          ~                                                                                                                                )                        2                                              (        1        )            
Here, S denotes an original picture, P denotes a decoded reference picture, x and y respective denote positions in the horizontal and the vertical direction in a picture. Moreover, ˜x (˜ is a symbol placed above x; the others are also the same) satisfies ˜x=x+MVx−FilterOffset, where MVx denotes the horizontal component of a motion vector that has been obtained beforehand, and FilterOffset denotes an offset for adjustment (the value obtained by dividing a filter length in the horizontal direction by 2). With respect to the vertical direction, ˜y=y+MVy is satisfied, where MVy denotes the vertical component of the motion vector. wci denotes a group of filter coefficients in the horizontal direction ci (0≤ci<6) that is to be determined.
Linear equations, the number of which being equal to the number of filter coefficients determined by Equation (1), are obtained, and minimizing processes are performed for fractional pixel positions in the horizontal direction independently of one another. Through the minimizing processes, three groups of 6-tap filter coefficients are obtained, and fractional-accuracy pixels a, b, and c are interpolated using these filter coefficients.
After the interpolation of the pixels in the horizontal direction has been completed, as shown by step 2 in FIG. 15C, an interpolation process in the vertical direction is performed. Filter coefficients in the vertical direction are determined by solving the linear problem similar to that in the horizontal direction. Specifically, filter coefficients in the vertical direction that minimize a prediction error energy function Ey2 of Equation (2) are analytically determined.
                    [                  Equation          ⁢                                          ⁢          2                ]                                                                      E          y          2                =                              ∑                          x              ,              y                                ⁢                                          ⁢                                    (                                                S                                      x                    ,                    y                                                  -                                                      ∑                                          c                      j                                                        ⁢                                                                          ⁢                                                            w                                              c                        j                                                              ·                                                                  P                        ^                                                                                              x                          ~                                                ,                                                                              y                            ~                                                    +                                                      c                            j                                                                                                                                                          )                        2                                              (        2        )            
Here, S denotes an original picture, ^P (^ is a symbol placed above P) denotes a picture which has been decoded and then interpolated in the horizontal direction, and x and y respectively denote positions in the horizontal direction and the vertical direction in a picture. Moreover, ˜x is represented as 4·(x+MVx), where MVx denotes the horizontal component of a motion vector that has been rounded off to the nearest whole number. With respect to the vertical direction, ˜y is represented as y+MVy−FilterOffset, where MVy denotes the vertical component of the motion vector, and FilterOffset denotes an offset for adjustment (the value obtained by dividing a filter length by 2). wcj denotes a group of filter coefficients in the vertical direction cj (0≤cj<6) that is to be determined.
Minimizing processes are performed for fractional-accuracy pixels independently of one another, and 12 groups of 6-tap filter coefficients are obtained. The remaining fractional-accuracy pixels are interpolated using these filter coefficients.
As stated above, it is necessary to encode 90 (=6×15) filter coefficients and transmit them to a decoder. In particular, since the overhead becomes large in low resolution encoding, filter coefficients to be transmitted are reduced using the symmetry of a filter. For example, as shown in FIG. 15A, b, h, i, j, and k are positioned at the centers with respect to interpolation directions, and with respect to the horizontal direction, coefficients obtained by inverting coefficients used for three points on the left side can be applied to three points on the right side. Similarly, with respect to the vertical direction, coefficients obtained by inverting coefficients used for three points on the upper side can be applied to three points on the lower side (c1=c6, c2=c5, and c3=c4).
Additionally, since the relationship between d and 1 is symmetric about h, inverted filter coefficients can be used. That is, by transmitting 6 coefficients for d, their values can also be applied to 1. c(d)1=c(1)6, c(d)2=c(1)5, c(d)3=c(1)4, c(d)4=c(1)3, c(d)5=c(1)2, and c(d)6=c(1)1 are satisfied. This symmetry is also used for e and m, f and n, and g and o. Although the same theory holds for a and c, since the result for the horizontal direction interpolation affects interpolation in the vertical direction, a and c are transmitted separately without using symmetry. With the use of the symmetry described above, the number of filter coefficients to be transmitted for each frame becomes 51 (15 for the horizontal direction and 36 for the vertical direction).
In the above adaptive interpolation filter of Non-Patent Document 6, the processing unit of the minimization of the prediction error energy is fixed to a frame. 51 filter coefficients are determined per one frame. If an encoding target frame is roughly divided into two types (or a plurality of types) of texture regions, the optimum filter coefficients are a group of coefficients in which both of them (all the textures) are taken into consideration. Under the situation in which characteristic filter coefficients are essentially obtained only in the vertical direction with respect to a region A and filter coefficients are obtained only in the horizontal direction with respect to a region B, filter coefficients are derived as the average of both of them.
Non-Patent Document 7 and Non-Patent Document 8 propose a method for achieving a reduction in prediction error energy and realizing improvement in coding efficiency by preparing a plurality of groups of filter coefficients and performing switching therebetween in accordance with local properties of a picture, without being limited to one group of filter coefficients (51 coefficients) per one frame.
As shown in FIG. 16A and FIG. 16B, it is assumed the case in which an encoding target frame includes textures having different properties. As shown in FIG. 16A, when one group of the optimized filter coefficients are transmitted for the entire frame, the properties of all the textures are taken into consideration. If the textures are not very much different from one another, it is considered that the filter coefficients obtained by optimizing its entirety are best; however, if the textures have properties that are opposite to one another, it is possible to further reduce the bit-rates of the entire frame by using filter coefficients that have been optimized for each texture as shown in FIG. 16B.
As a technique of division into regions, Non-Patent Document 7 and Non-Patent Document 8 employ a motion vector (the horizontal component, the vertical component, and the direction), spatial coordinates (the position of a macroblock, and the x coordinate and the y coordinate of a block), and the like, and performs the division into regions taking various properties of a picture into consideration.
Although the above matter is based on adaptive interpolation filters, it is not limited to the adaptive interpolation filters, and, when the property of a picture varies in a frame, the same discussion can even be applied to the case in which selection from fixed interpolation filter coefficients is performed as disclosed in Non-Patent Document 3 and Non-Patent Document 4. That is, it is possible to improve the coding efficiency as compared to the case in which one type of fixed interpolation filter is applied to the entire frame, provided that a fixed interpolation filter suitable for a region can be selected.