In recent years, devices have come into widespread use which subject an image to compression encoding by employing an encoding format handling image information as digital, and at this time compress the image by orthogonal transform such as discrete cosine transform or the like and motion compensation, taking advantage of redundancy which is a feature of the image information, in order to perform highly efficient transmission and storage of information. Examples of this encoding format include MPEG (Moving Picture Experts Group) and so forth.
In particular, MPEG2 (ISO/IEC 13818-2) is defined as a general-purpose image encoding format, and is a standard encompassing both of interlaced scanning images and sequential-scanning images, and standard resolution images and high definition images. For example, MPEG2 has widely been employed now by broad range of applications for professional usage and for consumer usage. By employing the MPEG2 compression format, a code amount (bit rate) of 4 through 8 Mbps is allocated in the event of an interlaced scanning image of standard resolution having 720×480 pixels, for example. By employing the MPEG2 compression format, a code amount (bit rate) of 18 through 22 Mbps is allocated in the event of an interlaced scanning image of high resolution having 1920×1088 pixels, for example. Thus, a high compression rate and excellent image quality can be realized.
MPEG2 has principally been aimed at high image quality encoding adapted to broadcasting usage, but does not handle lower code amount (bit rate) than the code amount of MPEG1, i.e., an encoding format having a higher compression rate. It is expected that demand for such an encoding format will increase from now on due to the spread of personal digital assistants, and in response to this, standardization of the MPEG4 encoding format has been performed. With regard to an image encoding format, the specification thereof was confirmed as international standard as ISO/IEC 14496-2 in December in 1998.
Further, in recent years, standardization of a standard called H.26L (ITU-T Q6/16 VCEG) has progressed with image encoding for television conference usage as the object. With H.26L, it has been known that though greater computation amount is requested for encoding and decoding thereof as compared to a conventional encoding format such as MPEG2 or MPEG4, higher encoding efficiency is realized. Also, currently, as part of activity of MPEG4, standardization for taking advantage of a function that is not supported by H.26L with this H.26L taken as base to realize higher encoding efficiency has been performed as Joint Model of Enhanced-Compression Video Coding. As a schedule of standardization, H.264 and MPEG-4 Part 10 (Advanced Video Coding, hereafter referred to as H.264/AVC) become an international standard in March, 2003.
Further, as an extension thereof, standardization of FRExt (Fidelity Range Extension) including a coding tool necessary for business use such as RGB, 4:2:2, or 4:4:4, 8×8DCT and quantization matrix stipulated by MPEG-2 has been completed as of February 2005. Accordingly, H.264/AVC can be used as an encoding format capable of suitably expressing even film noise included in movies, and has come to be employed for wide ranging applications such as Blu-Ray Disc (registered trademark) and so forth.
However, nowadays, needs for further high-compression encoding have been increased, such as intending to compress an image having around 4000×2000 pixels, which is quadruple of a high-vision image, or alternatively, needs for further high-compression encoding have been increased, such as intending to distribute a high-vision image within an environment with limited transmission capacity like the Internet. Therefore, with VCEG (=Video Coding Expert Group) under the control of ITU-T mentioned above, studies relating to improvement of encoding efficiency have continuously been performed.
Now, with motion prediction compensation according to the H.264/AVC format, prediction efficiency is improved by performing prediction/compensation processing with quarter-pixel precision.
For example, with the MPEG2 format, half-pixel precision motion prediction/compensation processing is performed by linear interpolation processing. On the other hand, with the H.264/AVC format, quarter-pixel precision prediction/compensation processing using a 6-tap FIR (Finite Impulse Response Filter) filter as an interpolation filter is performed.
FIG. 1 is a diagram for describing prediction/compensation processing of quarter-pixel precision with the H.264/AVC format. With the H.264/AVC format, quarter-pixel precision prediction/compensation processing is performed using 6-tap FIR (Finite Impulse Response Filter) filter.
In the example in FIG. 1, a position A indicates integer-precision pixel positions, positions b, c, and d indicate half-pixel precision positions, and positions e1, e2, and e3 indicate quarter-pixel precision positions. First, in the following Clip( ) is defined as in the following Expression (1).
                    [                  Mathematical          ⁢                                          ⁢          Expression          ⁢                                          ⁢          1                ]                                                                      Clip          ⁢                                          ⁢          1          ⁢                      (            a            )                          =                  {                                                                      0                  ;                                                                              if                  ⁢                                                                          ⁢                                      (                                          a                      <                      0                                        )                                                                                                                        a                  ;                                                            otherwise                                                                                      max_pix                  ;                                                                              if                  ⁢                                                                          ⁢                                      (                                          a                      >                      max_pix                                        )                                                                                                          (        1        )            
Note that in the event that the input image is of 8-bit precision, the value of max_pix is 255.
The pixel values at positions b and d are generated as with the following Expression (2), using a 6-tap FIR filter.
[Mathematical Expression 2]F=A−2−5·A−1+20·A0+20·A1−5·A2+A3b,d=Clip1((F+16)>>5)  (2)
The pixel value at the position c is generated as with the following Expression (3), using a 6-tap FIR filter in the horizontal direction and vertical direction.
[Mathematical Expression 3]F=b−2−5·b−1+2·b0+20·b1−5·b2+b3 orF=d−2−5·d−1+20·d0+20·d1−5·d2+d3c=Clip1((F+512)>>10)  (3)
Note that Clip processing is performed just once at the end, following having performed product-sum processing in both the horizontal direction and vertical direction.
The positions e1 through e3 are generated by linear interpolation as with the following Expression (4).
[Mathematical Expression 4]e1=(A+b+1)>>1e2=(b+d+1)>>1e3=(b+c+1)>>1  (4)
FIG. 2 is a diagram describing prediction/compensation processing relating to color difference signals with the H.264/AVC format. With the H.264/AVC format, quarter-pixel prediction/compensation processing is performed as described above with reference to FIG. 1, but in the case of 4:2:0 signals, ⅛-pixel precision prediction/compensation processing is performed regarding color difference signals.
In the example in FIG. 2, the black dots are pixels of integer-pixel precision stored in frame memory, and the A through D given to the black dots represent the pixel values of the pixels. If we way that the position (dx, dy) of a white dot is a position indicated by motion vector information in ⅛-pixel precision within a rectangular region surrounded by the pixels indicated by A through D, a prediction pixel value v at the position of the white dot is generated as with the following Expression (5).
                    [                  Mathematical          ⁢                                          ⁢          Expression          ⁢                                                            ⁢                                                          ⁢          5                ]                                                                      v          =                                                                                                                                        (                                                  s                          -                                                      d                            x                                                                          )                                            ⁢                                              (                                                  s                          -                                                      d                            y                                                                          )                                            ⁢                      A                                        +                                                                                                                                                                                            d                          x                                                ⁡                                                  (                                                      s                            -                                                          d                              y                                                                                )                                                                    ⁢                      B                                        +                                                                  (                                                  s                          -                                                      d                            x                                                                          )                                            ⁢                                              d                        y                                            ⁢                      C                                        +                                                                  d                        x                                            ⁢                                              d                        y                                            ⁢                      D                                                                                                          s              2                                      ⁢                                  ⁢                              where            ⁢                                                  ⁢            s                    =          8.                                    (        5        )            
Also, what sort of processing with which to select motion vectors obtained in decimal-pixel precision as described above is important in obtaining compressed images with high encoding efficiency. One example of this processing is a method implemented in reference software (reference software), called JM (Joint Model), disclosed in NPL 1.
Next, a motion search method implemented in JM will be described with reference to FIG. 3.
In the example in FIG. 3, pixels A through I represent pixels having pixel values of integer-pixel precision (hereinafter referred to as integer-pixel precision pixels). Pixels 1 through 8 are pixels having pixel values of half-pixel precision around the pixel E (hereinafter referred to as half-pixel precision pixels). Pixels a through h are pixels having pixel values of quarter-pixel precision around the pixel 6 (hereinafter referred to as quarter-pixel precision pixels).
With JM, as a first step, a motion vector which minimizes a cost function value such as the SAD (Sum of Absolute Difference) within a predetermined search range is obtained. Let us say that the pixel corresponding to the motion vector obtained in this way is the pixel E.
Next, as a second step, a pixel with a pixel value which minimizes the above-described cost function value is obtained from the pixel E and the pixels 1 through 8 of half-pixel precision surrounding the pixel E, and this pixel (the pixel 6 in the case of the example in FIG. 2) is taken as the pixel corresponding to the optimal motion vector of half-pixel precision.
Then, as a third step, a pixel with a pixel value which minimizes the above-described cost function value is obtained from the pixel 6 and the pixels a through h of quarter-pixel precision surrounding the pixel 6. Thus, the motion vector corresponding to the obtained pixel is the optimal motion vector of quarter-pixel precision.
As described above, quarter-pixel precision prediction/compensation processing is performed with the H.264/AVC format, and multiple techniques for further improving encoding efficiency have been proposed for this quarter-pixel precision prediction/compensation processing.
For example, with the H.264/AVC format, the filter coefficients for the interpolation filter to generate pixel values of sampling positions as to decimal-pixel precision motion vectors described above with reference to FIG. 1 have been predetermined, as described in NPL 2.
Accordingly, proposed in NPL 3 is to adaptively switch the filter coefficients such that the prediction residual is the smallest for each prediction frame.
That is to say, with NPL 3, first, as a first step, normal H.264/AVC format motion prediction processing is performed, and motion vector values are calculated for each motion compensation block.
As a second step, filter optimization is performed such that the motion residual is minimal for the motion vector values obtained in the first step.
Then, as a third step, motion search is performed again using the filter coefficient obtained in the second step, and the motion vector value is updated. Thus, encoding efficiency can be improved.
Filter coefficients and motion vector values can be optimized by further repeating the above steps.
Also, as described above, the macroblock size is defined as 16×16 pixels with the H.264/AVC format. However, a macroblock size of 16×16 pixels is not optimal for a large image frame such as with UHD (Ultra High Definition; 4000 pixels×2000 pixels) which is the object of next-generation encoding formats.
Accordingly, it is proposed in NPL 4 and so forth to extend the macroblock size to be a size of 32 pixels×32 pixels, for example.
Note that the above-described FIG. 1 through FIG. 3 will also be used hereinafter to describe the present invention.