In video encoding, in the case of inter-frame prediction (motion compensation) encoding which performs prediction between different frames, a motion vector is obtained by referring to an already decoded frame so as to minimize the prediction error energy and the like, a residual signal is subjected to orthogonal transform, quantization is applied, and entropy encoding is performed, thereby binary data is obtained. In order to increase the encoding efficiency, a prediction scheme that provides higher prediction accuracy is required, and it is indispensable to reduce the prediction error energy.
A great number of tools that increase the accuracy of inter-frame prediction have been introduced into video encoding standards. For example, if there is occlusion in a nearest frame, the prediction error energy can be further reduced by referring to a frame that is distant in the time domain to some extent, and thus, in H.264/AVC, a plurality of frames can be referred to. This tool is called multiple reference frame prediction.
In addition, in order to deal with motions having complex shapes, a block size can be subdivided, such as 16×8, 8×16, 8×4, 4×8, and 4×4, in addition to 16×16 and 8×8. This tool is called variable block size prediction.
Similar to these, pixels of ½ accuracy are interpolated from integer-accuracy pixels of a reference frame using a 6-tap filter, and then pixels of ¼ accuracy are generated by linear interpolation using these pixels. Accordingly, it becomes possible to realize accurate prediction for motions of non-integer accuracy. This tool is called ¼ pixel accuracy prediction.
In order to develop next-generation video coding standards that provide higher encoding efficiency than that of the H.264/AVC, various proposals are now gathering from all over the world to the international organization for standardization ISO/IEC “MPEG” (International Organization for Standardization/International Electrotechnical Commission “Moving Picture Experts Group”) and ITU-T “VCEG” (International Telecommunication Union-Telecommunication Standardization Sector “Video Coding Experts Group”). Among them, in particular, many proposals relating to inter-frame prediction (motion compensation) have been presented, and software for next-generation video coding that is being drawn up under the leadership of the VCEG (hereinafter referred to as KTA (Key Technical Area) software) employs a tool for reducing the bit-rates of motion vectors, a tool for extending the block size to 16×16 or larger, and the like.
In particular, a tool for adaptively changing interpolation filter coefficients for a fractional-accuracy pixel is called an adaptive interpolation filter, it is effective for almost all the pictures, and it was first adopted in the KTA software. This technology is also employed in a lot of contributions to an invitation (call for proposal) for a next-generation video encoding test model issued by the group JCT-VC (Joint Collaborative Team on Video Coding), which is being jointly promoted by the MPEG and the VCEG for development of next-generation video coding standards. Because of a high contribution to improvement in encoding efficiency, improvement in performance of adaptive interpolation filters is a very expectative field in the future.
Although the current situation is as described above, conventionally, the following filters have been used as interpolation filters in video coding.
[Fixed Interpolation]
In the conventional video coding standards MPEG-1/2, as shown in FIG. 15A, in order to interpolate a pixel of ½ accuracy, an interpolated pixel is generated by averaging from two adjacent integer-accuracy pixels (also just called integer pixels). That is, a mean filter of [½, ½] is applied to the two integer pixels. Because it is a very simple process, it is effective from viewpoint of the computational complexity, but the performance of the filter is not high for the purpose of obtaining a pixel of ¼ accuracy.
In the MPEG-4 Part 2, a pixel of ½ pixel accuracy is generated using a mean filter in a similar manner, but the advanced simple profile (ASP) also supports motion compensation of ¼ pixel accuracy. The position of a ½ pixel is calculated using a one-dimensional 8-tap filter as shown in FIG. 15B. Thereafter, the position of a ¼ pixel is derived using a mean filter.
Moreover, in the H.264/AVC, as shown in FIG. 15C, when the position of a ½ pixel is to be interpolated, interpolation is performed using 6 integer pixels including three points on the left side of the pixel to be interpolated and three points on the right side of the pixel to be interpolated. With respect to the vertical direction, interpolation is performed using 6 integer pixels including three points on the upper side and three points on the lower side. The filter coefficients are [(1, −5, 20, 20, −5, 1)/32]. After the positions of ½ pixels have been interpolated, the positions of ¼ pixels are interpolated using a mean filter of [½, ½]. Since it is necessary to interpolate the positions of all the ½ pixels, the computational complexity is high, but high-performance interpolation is possible, so that the encoding efficiency is improved. Non-Patent Document 1, Non-Patent Document 2, and Non-Patent Document 3 disclose the details of the above fixed interpolation filter.
[Adaptive Interpolation]
In the H.264/AVC, the values of filter coefficients are constant, irrespective of conditions of an input picture (the type of a sequence, the size of a picture, and a frame rate) and encoding conditions (the block size, the structure of a GOP (group of pictures), and QP (quantization parameters)). When the values of filter coefficients are fixed, for example, effects that vary over time, such as aliasing, a quantization error, an error resulting from motion estimation, camera noise, are not taken into consideration. Therefore, it is considered that improvement in performance is limited in terms of the encoding efficiency. Accordingly, Non-Patent Document 4 proposes a scheme of adaptively changing interpolation filter coefficients, which is called a non-separable adaptive interpolation filter.
In Non-Patent Document 4, a two-dimensional interpolation filter (6×6=36 filter coefficients) is intended, and the filter coefficients are determined so as to minimize the prediction error energy. Although it is possible to realize higher encoding efficiency than that obtained by using a one-dimensional 6-tap fixed interpolation filter employed in the H.264/AVC, the computational complexity for obtaining filter coefficients is very high, and thus Non-Patent Document 5 introduces a proposal for reducing the computational complexity.
The technique introduced in the Non-Patent Document 5 is called a separable adaptive interpolation filter (SAIF), and it uses a one-dimensional 6-tap interpolation filter rather than a two-dimensional interpolation filter.
FIG. 16A to FIG. 16C are diagrams illustrating a method for interpolating a pixel of non-integer accuracy in the separable adaptive interpolation filter (SAIF). Its procedure is such that, first, as shown by step 1 in FIG. 16B, pixels in the horizontal direction (a, b, and c) are interpolated. Integer-accuracy pixels C1 to C6 are used for determining filter coefficients. Filter coefficients in the horizontal direction that minimize a prediction error energy function Eh2 of Equation (1) are analytically determined by the commonly known least square method (see Non-Patent Document 4).
                    [                  Equation          ⁢                                          ⁢          1                ]                                                                      E          h          2                =                              ∑                          x              ,              y                                ⁢                                    (                                                S                                      x                    ,                    y                                                  -                                                      ∑                    ci                                    ⁢                                                            w                      ci                                        ·                                          P                                                                                                    x                            ~                                                    +                                                      c                            i                                                                          ,                                                  y                          ~                                                                                                                                )                        2                                              (        1        )            
Here, S denotes an original picture, P denotes a decoded reference picture, x and y respective denote positions in the horizontal and the vertical direction in a picture. Moreover, ˜x (˜ is a symbol placed above x; the others are also the same) satisfies ˜x=x+MVx−FilterOffset, where MVx denotes the horizontal component of a motion vector that has been obtained beforehand, and FilterOffset denotes an offset for adjustment (the value obtained by dividing a filter length in the horizontal direction by 2). With respect to the vertical direction, ˜y=y+MVy is satisfied, where MVy denotes the vertical component of the motion vector. wci denotes a group of filter coefficients in the horizontal direction ci (0≦ci<6) that is to be determined.
Linear equations, the number of which being equal to the number of filter coefficients determined by Equation (1), are obtained, and minimizing processes are performed for fractional pixel positions in the horizontal direction independently of one another. Through the minimizing processes, three groups of 6-tap filter coefficients are obtained, and fractional-accuracy pixels a, b, and c are interpolated using these filter coefficients.
After the interpolation of the pixels in the horizontal direction has been completed, as shown by step 2 in FIG. 16C, an interpolation process in the vertical direction is performed. Filter coefficients in the vertical direction are determined by solving the linear problem similar to that in the horizontal direction. Specifically, filter coefficients in the vertical direction that minimize a prediction error energy function Ev2 of Equation (2) are analytically determined.
                    [                  Equation          ⁢                                          ⁢          2                ]                                                                      E          v          2                =                              ∑                          x              ,              y                                ⁢                                    (                                                S                                      x                    ,                    y                                                  -                                                      ∑                    cj                                    ⁢                                                            w                      cj                                        ·                                                                  P                        ^                                                                                              x                          ~                                                ,                                                                              y                            ~                                                    +                                                      c                            j                                                                                                                                                          )                        2                                              (        2        )            
Here, S denotes an original picture, ^P (^ is a symbol placed above P) denotes a picture which has been decoded and then interpolated in the horizontal direction, and x and y respectively denote positions in the horizontal direction and the vertical direction in a picture. Moreover, ˜x is represented as 4·(x+MVx), where MVx denotes the horizontal component of a motion vector that has been rounded off to the nearest whole number. With respect to the vertical direction, ˜y is represented as y+MVy−FilterOffset, where MVy denotes the vertical component of the motion vector, and FilterOffset denotes an offset for adjustment (the value obtained by dividing a filter length by 2). wcj denotes a group of filter coefficients in the vertical direction cj (0≦cj<6) that is to be determined.
Minimizing processes are performed for fractional-accuracy pixels independently of one another, and 12 groups of 6-tap filter coefficients are obtained. The remaining fractional-accuracy pixels are interpolated using these filter coefficients.
As stated above, it is necessary to encode 90 (=6×15) filter coefficients and transmit them to a decoding end. In particular, since the overhead becomes large in low resolution encoding, filter coefficients to be transmitted are reduced using the symmetry of a filter. For example, in FIG. 16A, b, h, i, j, and k are positioned at the centers of integer-accuracy pixels, and with respect to the horizontal direction, coefficients obtained by inverting coefficients used for three points on the left side can be applied to three points on the right side. Similarly, with respect to the vertical direction, coefficients obtained by inverting coefficients used for three points on the upper side can be applied to three points on the lower side (c1=c6, c2=c5, and c3=c4).
Additionally, since the relationship between d and l is symmetric about h, inverted filter coefficients can be used. That is, by transmitting 6 coefficients for d, their values can also be applied to l. c(d)1=c(l)6, c(d)2=c(l)5, c(d)3=c(l)4, c(d)4=c(l)3, c(d)5=c(l)2, and c(d)6=c(l)1 are satisfied. This symmetry is also used for e and m, f and n, and g and o. Although the same theory holds for a and c, since the result for the horizontal direction interpolation affects interpolation in the vertical direction, a and c are transmitted separately without using symmetry. With the use of the symmetry described above, the number of filter coefficients to be transmitted for each frame becomes 51 (15 for the horizontal direction and 36 for the vertical direction).
In the above adaptive interpolation filter of Non-Patent Document 5, the processing unit of the minimization of the prediction error energy is fixed to a frame. 51 filter coefficients are determined per one frame. If an encoding target frame is roughly divided into two types (or a plurality of types) of texture regions, the optimum filter coefficients are a group of coefficients in which both of them (all the textures) are taken into consideration. Under the situation in which characteristic filter coefficients are essentially obtained only in the vertical direction with respect to a region A and filter coefficients are obtained only in the horizontal direction with respect to a region B, filter coefficients are derived as the average of both of them.
Non-Patent Document 6 proposes a method for achieving a reduction in prediction error energy and realizing improvement in encoding efficiency by preparing a plurality of groups of filter coefficients and perform switching therebetween in accordance with local characteristics of a picture, without being limited to one group of filter coefficients (51 coefficients) per one frame.
As shown in FIG. 17A and FIG. 17B, it is assumed the case in which an encoding target frame includes textures having different properties. As shown in FIG. 17A, when one group of the optimized filter coefficients are transmitted for the entire frame, the properties of all the textures are taken into consideration. If the textures are not so much different from one another, it is considered that the filter coefficients obtained by optimizing its entirety are best; however, if the textures have properties that are opposite to one another, it is possible to further reduce the bit-rates of the entire frame by using filter coefficients that have been optimized for each texture as shown in FIG. 17B. For this reason, Non-Patent Document 6 proposes a method using, for one frame, a plurality of groups of filter coefficients that have been optimized by using division into regions.
As a technique of the division into regions, Non-Patent Document 6 employs a motion vector (the horizontal component, the vertical component, and the direction), spatial coordinates (the position of a macroblock, and the x coordinate and the y coordinate of a block), and the like, and performs the division into regions taking various properties of a picture into consideration.
FIG. 18 illustrates an example of a configuration of a video encoding apparatus using a conventional region-dividing type adaptive interpolation filter as disclosed in Non-Patent Document 6.
In a video encoding apparatus 100, a region dividing unit 101 divides an encoding target frame of an input video signal into a plurality of regions which are a plurality of blocks serving as units for adaptively switching interpolation filter coefficients. An interpolation filter coefficient switching unit 102 switches interpolation filter coefficients for a fractional-accuracy pixel used for a reference picture in predictive encoding, for each of the regions divided by the region dividing unit 101. For example, as the interpolation filter coefficients to be switched, filter coefficients optimized by a filter coefficient optimizing unit 1021 are used. The filter coefficient optimizing unit 1021 calculates interpolation filter coefficients that minimize the prediction error energy between an original picture and an interpolated reference picture for each of the regions.
A prediction signal generating unit 103 is provided with a reference picture interpolating unit 1031 and a motion detecting unit 1032. The reference picture interpolating unit 1031 applies an interpolation filter using the interpolation filter coefficients selected by the interpolation filter coefficient switching unit 102 to a decoded reference picture stored in a reference picture memory 107. The motion detecting unit 1032 performs a motion search on the interpolated reference picture to calculate a motion vector. The prediction signal generating unit 103 generates a predicted signal by motion compensation using the motion vector of fractional-accuracy calculated by the motion detecting unit 1032.
A predictive encoding unit 104 calculates a residual signal between the input video signal and the predicted signal, performs orthogonal transform on the residual signal, and quantizes transform coefficients, thereby performing predictive encoding. Moreover, a decoding unit 106 performs decoding on the result of the predictive encoding, and stores a decoded signal in the reference picture memory 107 for the subsequent predictive encoding.
A variable-length encoding unit 105 performs variable-length encoding on the quantized transform coefficients and the motion vector, performs variable-length encoding on the interpolation filter coefficients selected by the interpolation filter coefficient switching unit 102 for each of the regions, and outputs them as an encoded bitstream.
FIG. 19 illustrates an example of a configuration of a video decoding apparatus using the conventional region-dividing type adaptive interpolation filter. The bitstream encoded by the video encoding apparatus 100 shown in FIG. 18 is decoded by a video decoding apparatus 200 shown in FIG. 19.
In the video decoding apparatus 200, a variable-length decoding unit 201 inputs the encoded bitstream, and decodes the quantized transform coefficients, the motion vector, groups of the interpolation filter coefficients, and the like. A region determining unit 202 determines regions that are units for adaptively switching interpolation filter coefficients for a decoding target frame. An interpolation filter coefficient switching unit 203 switches the interpolation filter coefficients decoded by the variable-length decoding unit 201 for each of the regions determined by the region determining unit 202.
A reference picture interpolating unit 2041 in a prediction signal generating unit 204 applies an interpolation filter using interpolation filter coefficients received from the interpolation filter coefficient switching unit 203 to a decoded reference picture stored in a reference picture memory 206, to restore pixels of fractional-accuracy of the reference picture. The prediction signal generating unit 204 generates a predicted signal of a decoding target block from a reference picture for which the pixels of fractional-accuracy have been restored.
A predictive decoding unit 205 performs inverse quantization on the quantization coefficients decoded by the variable-length decoding unit 201, inverse orthogonal transform, and the like, and sums the calculated prediction residual signal and the predicted signal generated by the prediction signal generating unit 204 to generate a decoded signal, and outputs it as a decoded picture. In addition, the decoded signal decoded by the predictive decoding unit 205 is stored in the reference picture memory 206 for the subsequent predictive decoding.