In recent years, devices have come into widespread use which subject an image to compression encoding by employing an encoding format handling image information as digital signals, and at this time compress the image by orthogonal transform such as discrete cosine transform or the like and motion compensation, taking advantage of redundancy which is a feature of the image information, in order to perform highly efficient transmission and storage of information. Examples of this encoding method include MPEG (Moving Picture Experts Group) and so forth.
In particular, MPEG2 (ISO/IEC 13818-2) is defined as a general-purpose image encoding format, and is a standard encompassing both of interlaced scanning images and sequential-scanning images, and standard resolution images and high definition images. For example, MPEG2 has widely been employed now by broad range of applications for professional usage and for consumer usage. By employing the MPEG2 compression format, a code amount (bit rate) of 4 through 8 Mbps is allocated in the event of an interlaced scanning image of standard resolution having 720×480 pixels, for example. By employing the MPEG2 compression format, a code amount (bit rate) of 18 through 22 Mbps is allocated in the event of an interlaced scanning image of high resolution having 1920×1088 pixels, for example. Thus, a high compression rate and excellent image quality can be realized.
MPEG2 has principally been aimed at high image quality encoding adapted to broadcasting usage, but does not handle lower code amount (bit rate) than the code amount of MPEG1, i.e., an encoding format having a higher compression rate. It is expected that demand for such an encoding format will increase from now on due to the spread of personal digital assistants, and in response to this, standardization of the MPEG4 encoding format has been performed. With regard to an image encoding format, the specification thereof was confirmed as an international standard as ISO/IEC 14496-2 in December in 1998.
Further, in recent years, standardization of a standard called H.26L (ITU-T Q6/16 VCEG) has progressed with image encoding for television conference usage as the object. With H.26L, it has been known that though greater computation amount is requested for encoding and decoding thereof as compared to a conventional encoding format such as MPEG2 or MPEG4, higher encoding efficiency is realized. Also, currently, as part of activity of MPEG4, standardization for taking advantage of a function that is not supported by H.26L with this H.26L taken as base to realize higher encoding efficiency has been performed as Joint Model of Enhanced-Compression Video Coding. As a schedule of standardization, H.264 and MPEG-4 Part10 (Advanced Video Coding, hereafter referred to as H.264/AVC) become an international standard in March, 2003.
Further, as an expansion thereof, standardization of FRExt (Fidelity Range Extension), which includes encoding tools necessary for operations such as RGB, 4:2:2, 4:4:4, and MPEG-2 stipulated 8×8DCT and quantization matrices, has been completed in February of 2005. Accordingly, an encoding format capable of expressing well film noise included in movies using H.264/AVC was obtained, and is to be used in a wide range of applications such as Blu-Ray Disc®.
However, as of recent, there are increased needs for even further high compression encoding, such as to compress images around 4000×2000 pixels, which is fourfold that of Hi-Vision images, or such as to distribute Hi-Vision images in an environment with limited transmission capacity, such as the Internet. Accordingly, the VCEG (=Video Coding Expert Group) under ITU-T, described above, is continuing study relating to improved encoding efficiency.
Incidentally, for example, with the MPEG2 format, motion prediction/compensation processing with ½ pixel precision has been performed by linear interpolation processing. On the other hand, with the H.264/AVC format, prediction/compensation processing with ¼ pixel precision using 6-tap FIR (Finite Impulse Response Filter) filter as an interpolation filter has been performed.
FIG. 1 is a diagram for describing prediction/compensation processing with ¼ pixel precision according to the H.264/AVC format. With the H.264/AVC format, prediction/compensation processing with ¼ pixel precision using 6-tap FIR (Finite Impulse Response Filter) filter is performed.
With the example in FIG. 1, positions A indicate the positions of integer precision pixels, and positions b, c, and d indicate positions with ½ pixel precision, and positions e1, e2, and e3 indicate positions with ¼ pixel precision. First, hereafter, Clip( ) is defined as with the following Expression (1).
                    [                  Mathematical          ⁢                                          ⁢          Expression          ⁢                                          ⁢          1                ]                                                                      Clip          ⁢                                          ⁢          1          ⁢                      (            a            )                          =                  {                                                                      0                  ;                                      if                    ⁢                                                                                  ⁢                                          (                                              a                        <                        0                                            )                                                                                                                                            a                  ;                  otherwise                                                                                                      max_pix                  ;                                      if                    ⁢                                                                                  ⁢                                          (                                              a                        >                        max_pix                                            )                                                                                                                              (        1        )            
Note that in the event that the input image has 8-bit precision, the value of max_pix is 255.
The pixel values in the positions b and d are generated as with the following Expression (2) using a 6-tap FIR filter.[Mathematical Expression 2]F=A−2−5·A−1+20·A0+20·A1−5·A2+A3 b,d=Clip1((F+16)>>5)  (2)
The pixel value in the position c is generated as with the following Expression (3) by applying a 6-tap FIR filter in the horizontal direction and the vertical direction.[Mathematical Expression 3]F=b−2−5·b−1+20·b0+20·b1−5·b2+b3 orF=d−2−5·d−1+20·d0+20·d1−5·d2+d3 c=Clip1((F+512)>>10)  (3)
Note that Clip processing is lastly executed only once after both of sum-of-products processing in the horizontal direction and the vertical direction are performed.
Positions e1 through e3 are generated by linear interpolation as shown in the following Expression (4).[Mathematical Expression 4]e1=(A+b+1)>>1e2=(b+d+1)>>1e3=(b+c+1)>>1  (4)
Also, with the MPEG2 format, in the event of the frame motion compensation mode, motion prediction/compensation processing is performed in increments of 16×16 pixels, and in the event of the field motion compensation mode, motion prediction/compensation processing is performed as to each of the first field and the second field in increments of 16×8 pixels.
On the other hand, with motion prediction compensation with the H.264/AVC format, the macroblock size is 16×16 pixels, but motion prediction/compensation can be performed with the block size being variable.
FIG. 2 is a diagram illustrating an example of the block size of motion prediction/compensation according to the H.264/AVC format.
Macroblocks made up of 16×16 pixels divided into 16×16-pixel, 16×8-pixel, 8×16-pixel, and 8×8-pixel partitions are shown from the left in order on the upper tier in FIG. 2. 8×8-pixel partitions divided into 8×8-pixel, 8×4-pixel, 4×8-pixel, and 4×4-pixel sub partitions are shown from the left in order on the lower tier in FIG. 2.
That is to say, with the H.264/AVC format, one macroblock may be divided into one of 16×16-pixel, 16×8-pixel, 8×16-pixel, and 8×8-pixel partitions with each partition having independent motion vector information. Also, an 8×8-pixel partition may be divided into one of 8×8-pixel, 8×4-pixel, 4×8-pixel, and 4×4-pixel sub partitions with each sub partition having independent motion vector information.
Also, with the H.264/AVC format, motion prediction/compensation processing of multi-reference frames is also performed.
FIG. 3 is a diagram for describing the prediction/compensation processing of multi-reference frames according to the H.264/AVC format. With the H.264/AVC format, the motion prediction/compensation method of multi-reference frames (Multi-Reference Frame) is stipulated.
With the example in FIG. 3, the current frame Fn to be encoded from now on, and encoded frames Fn-5 through Fn-1, are shown. The frame Fn-1 is, on the temporal axis, a frame one frame before the current frame Fn, the frame Fn-2 is a frame two frames before the current frame Fn, and the frame Fn-3 is a frame three frames before the current frame Fn. Similarly, the frame Fn-4 is a frame four frames before the current frame Fn, and the frame Fn-5 is a frame five frames before the current frame Fn. In general, the closer to the current frame Fn a frame is on the temporal axis, the smaller a reference picture number (ref_id) to be added is. Specifically, the frame Fn-1 has the smallest reference picture number, and hereafter, the reference picture numbers are small in the order of Fn-2, . . . , Fn-5.
With the current frame Fn, a block A1 and a block A2 are shown, a motion vector V1 is searched with assuming that the block A1 is correlated with a block A1′ of the frame Fn-2 that is two frames before the current frame Fn. Similarly, a motion vector V2 is searched with assuming that the block A2 is correlated with a block A1′ of the frame Fn-4 that is four frames before the current frame Fn.
As described above, with the H.264/AVC format, different reference frames may be referenced in one frame (picture) with multi-reference frames stored in memory. That is, independent reference frame information (reference picture number (ref_id)) may be provided for each block in one picture, such that the block A1 references the frame Fn-2, and the block A2 references the frame Fn-4, for example.
Here, the blocks indicate one of 16×16-pixel, 16×8-pixel, 8×16-pixel, and 8×8-pixel partitions described with reference to FIG. 2. Reference frames within an 8×8-pixel sub-block partition have to agree.
As described above, with the H.264/AVC format, by the ¼-pixel motion prediction/compensation processing described above with reference to FIG. 1, and the motion prediction/compensation processing described above with reference to FIG. 2 and FIG. 3 being performed, vast amounts of motion vector information are generated, and if these are encoded without change, deterioration in encoding efficiency is caused. In response to this, with the H.264/AVC format, reduction in motion vector coding information has been realized, according to a method shown in FIG. 4.
FIG. 4 is a diagram for describing a motion vector information generating method according to the H.264/AVC format.
With the example in FIG. 4, a current block E to be encoded from now on (e.g., 16×16 pixels), and blocks A through D, which have already been encoded, adjacent to the current block E are shown.
That is to say, the block D is adjacent to the upper left of the current block E, the block B is adjacent to above the current block E, the block C is adjacent to the upper right of the current block E, and the block A is adjacent to the left of the current block E. Note that the reason why the blocks A through D are not sectioned is because each of the blocks represents a block having one structure of 16×16 pixels through 4×4 pixels described above with reference to FIG. 2.
For example, let us say that motion vector information as to X (=A, B, C, D, E) is represented with mvX. First, prediction motion vector information pmvE as to the current block E is generated as with the following Expression (5) by median prediction using motion vector information regarding the blocks A, B, and C.pmvE=med(mvA,mvB,mvC)  (5)
The motion vector information regarding the block C may not be usable (may be unavailable) due to a reason such as being at the edge of an image frame, not having been encoded yet, or the like. In this case, the motion vector information regarding the block D is used instead of the motion vector information regarding the block C.
Data mvdE to be added to the header portion of the compressed image, serving as the motion vector information as to the current block E, is generated as with the following Expression (6) using pmvE.mvdE=mvE−pmvE  (6)
Note that, in reality, processing is independently performed as to the components in the horizontal direction and vertical direction of the motion vector information.
In this way, prediction motion vector information is generated, difference motion vector information that is difference between the prediction motion vector information generated based on correlation with an adjacent block, and the motion vector information is added to the header portion of the compressed image, whereby the motion vector information can be reduced.
Also, though the information amount of the motion vector information regarding B pictures is vast, with the H.264/AVC format, a mode referred to as a direct mode is prepared. In the direct mode, motion vector information is not stored in a compressed image.
That is to say, on the decoding side, with motion vector information around the current block, or a reference picture, the motion vector information of the current block is extracted from the motion vector information of a co-located block that is a block having the same coordinates as the current block. Accordingly, the motion vector information does not have to be transmitted to the decoding side.
This direct mode includes two types, a spatial direct mode (Spatial Direct Mode) and a temporal direct mode (Temporal Direct Mode). The spatial direct mode is a mode for taking advantage of correlation of motion information principally in the spatial direction (horizontal and vertical two-dimensional space within a picture), and generally has an advantage in the event of an image including similar motions of which the motion speeds vary. On the other hand, the temporal direct mode is a mode for taking advantage of correlation of motion information principally in the temporal direction, and generally has an advantage in the event of an image including different motions of which the motion speeds are constant.
Which is to be employed of these spatial direct mode and temporal direct mode can be switched for each slice.
Referencing FIG. 4 again, the spatial direct mode according to the H.264/AVC format will be described. With the example in FIG. 4, as described above, the current block E to be encoded from now on (e.g., 16×16 pixels), and the blocks A through D, which have already been encoded, adjacent to the current block E are shown. Also, the motion vector information as to X (=A, B, C, D, E) is represented with mvX, for example.
The prediction motion vector information pmvE as to the current block E is generated as with the above-described Expression (5) by median prediction using the motion vector information regarding the blocks A, B, and C. Also, motion vector information mvE as to the current block E in the spatial direct mode is represented with the following Expression (7).mvE=pmvE  (7)
That is to say, in the spatial direct mode, the prediction motion vector information generated by median prediction is taken as the motion vector information of the current block. That is to say, the motion vector information of the current block is generated from the motion vector information of encoded blocks. Accordingly, the motion vector according to the spatial direct mode can also be generated on the decoding side, and accordingly, the motion vector information does not have to be transmitted to the decoding side.
Next, the temporal direct mode according to the H.264/AVC format will be described with reference to FIG. 5.
With the example in FIG. 5, temporal axis t represents elapse of time, an L0 (List0) reference picture, the current picture to be encoded from now on, and an L1 (List1) reference picture are shown from the left in order. Note that, with the H.264/AVC format, the row of the L0 reference picture, current picture, and L1 reference picture is not restricted to this order.
The current block of the current picture is included in a B slice, for example. Accordingly, with regard to the current block of the current picture, L0 motion vector information mvL0 and L1 motion vector information mvL1 based on the temporal direct mode are calculated as to the L0 reference picture and L1 reference picture.
Also, with the L0 reference picture, motion vector information mvcol in a co-located block that is a block positioned in the same spatial address (coordinates) as the current block to be encoded from now on is calculated based on the L0 reference picture and L1 reference picture.
Now, let us say that distance on the temporal axis between the current picture and L0 reference picture is taken as TDB, and distance on the temporal axis between the L0 reference picture and L1 reference picture is taken as TDD. In this case, the L0 motion vector information mvL0 in the current picture, and the L1 motion vector information mvL1 in the current picture can be calculated with the following Expression (8).
                    [                  Mathematical          ⁢                                          ⁢          Expression          ⁢                                          ⁢          5                ]                                                                                  mv                          L              ⁢                                                          ⁢              0                                =                                                    TD                B                                            TD                D                                      ⁢                          mv              col                                      ⁢                                  ⁢                              mv                          L              ⁢                                                          ⁢              1                                =                                                                      TD                  D                                -                                  TD                  B                                                            TD                D                                      ⁢                          mv              col                                                          (        8        )            
Note that, with the H.264/AVC format, there is no information equivalent to distances TDB and TDD on the temporal axis t as to the current picture within the compressed image. Accordingly, POC (Picture Order Count) that is information indicating the output sequence of pictures is employed as the actual values of the distances TDB and TDD.
Also, with the H.264/AVC format, the direct mode can be defined with increments of 16×16 pixel macroblocks, or 8×8 pixel blocks.
Now, referencing FIG. 4, NPL 1 proposes the following method to improve motion vector encoding using median prediction.
That is to say, the proposal is to adaptively use one of spatial prediction motion vector information (Spatial Predictor) obtained from the above-described Expression (5), and also temporal prediction motion vector information (Temporal Predictor) and spatio-temporal prediction motion vector information (Spatio-Temporal Predictor) which will be described with reference to FIG. 6, as prediction motion vector information.
With the example in FIG. 6, there are shown a frame N which is the current frame to be encoded, and a frame N-1 which is a reference frame referenced at the time of performing searching of motion vectors.
In frame N, the current block to be now encoded has motion vector information my indicated as to the current block, and the already-encoded blocks adjacent to the current block each have motion vector information mva, mvb, mvc, and mvd, as to the respective blocks.
Specifically, the block adjacent to the current block at the upper left has the motion vector information mvd indicated corresponding to that block, and the block adjacent above the current block has the motion vector information mvb indicated corresponding to that block. The block adjacent to the current block at the upper right has the motion vector information mvc indicated corresponding to that block, and the block adjacent to the current block at the left has the motion vector information mva indicated corresponding to that block.
In frame N-1, a corresponding block (Co-Located block) to the current block has motion vector information mvcol indicated as to the corresponding block. Note that here, a corresponding block is a block in an already-encoded frame that is different from the current frame (a frame situated before or after), and is a block at a position corresponding to the current block.
Also, in frame N-1, the blocks adjacent to the corresponding block have motion vector information mvt4, mvtv, mvt7, mvt1, mvt3, mvt5, mvt2, and mvt6, indicated respectively as to each block.
Specifically, the block adjacent to the corresponding block at the upper left has motion vector information mvt4 indicated corresponding to that block, and the block adjacent above the corresponding block has motion vector information mvt0 indicated corresponding to that block. The block adjacent to the corresponding block at the upper right has motion vector information mvt7 indicated corresponding to that block, and the block adjacent to the corresponding block at the left has motion vector information mvt1 indicated corresponding to that block. The block adjacent to the corresponding block at the right has motion vector information mvt3 indicated corresponding to that block, and the block adjacent to the corresponding block at the lower left has motion vector information mvt5 indicated corresponding to that block. The block adjacent below the corresponding block has motion vector information mvt2 indicated corresponding to that block, and the block adjacent to the corresponding block at the lower right has motion vector information mvt6 indicated corresponding to that block.
Also, while the prediction motion vector information pmv in the above-described Expression (5) was generated is motion vector information of blocks adjacent to the current block, the respective prediction motion vector information pmvtm5, pmvtm9, and pmvspt, are defined as with the following Expressions (9) and (10). Note that of these, pmvtm5 and pmvtm9 are temporal prediction motion vector information, and pmvspt is spatio-temporal prediction motion vector information.
Temporal Predictor:pmvtm5=med(mvcol, mvt0, . . . , mvt3)pmvtm9=med(mvcol, mvt0, . . . , mvt7)  (9)
Spatio-Temporal Predictor:pmvspt=med(mvcol, mvcol, mva, mvb, mvc)  (10)
As to which prediction motion vector information to use of Expression (5), Expression (9), and Expression (10), cost function values are calculated for cases of using each prediction motion vector information, and thus selected. A flag indicating information relating to which prediction motion vector information has been used for each block is then transmitted to the decoding side.
Note that the above-described drawings and Expressions will also be used in description of the present application as appropriate.