Conventionally, encoding schemes using motion compensation and the orthogonal transform such as the discrete cosine transform, Karhunen-Loève transform, or wavelet transform, including MPEG (Moving Picture Experts Group), H.26x, etc., have been generally utilized as encoding schemes in the case of handling moving images. In these moving image encoding schemes, the reduction in amount of code is achieved by utilizing the correlation in the spatial direction and time direction among the characteristics of an input image signal to be encoded.
For example, in H.264, unidirectional prediction or bidirectional prediction is used when an inter-frame that is a frame to be subjected to inter-frame prediction (inter-prediction) is generated utilizing the correlation in the time direction. Inter-frame prediction is designed to generate a prediction image on the basis of frames at different times.
FIG. 1 is a diagram illustrating an example of unidirectional prediction.
As illustrated in FIG. 1, in a case where a frame to be encoded P0 that is a frame at the present time, which is an encoding target, is to be generated through unidirectional prediction, motion compensation is performed using, as reference frames, encoded frames at times in the past or future in time with respect to the present time. The residual between a prediction image and an actual image is encoded by utilizing the correlation in the time direction, thus making it possible to reduce the amount of code. Reference frame information and a motion vector are used, respectively, as information specifying a reference frame and information specifying a position to be referred to in the reference frame, and these pieces of information are transmitted from the encoding side to the decoding side.
Here, the number of reference frames is not limited to one. For example, in H.264, it is possible to use a plurality of frames as reference frames. As illustrated in FIG. 1, in a case where two frames closer in time to the frame to be encoded P0 are denoted by reference frames R0 and R1 in this order, a pixel value in an arbitrary macroblock in the frame to be encoded P0 can be predicted from the pixel value of an arbitrary pixel in the reference frame R0 or R1.
In FIG. 1, a box indicated inside each frame represents a macroblock. If a macroblock in the frame to be encoded P0, which is a prediction target, is represented by a macroblock MBP0, then, the macroblock in the reference frame R0 corresponding to the macroblock MBP0 is a macroblock MBR0 that is specified by a motion vector MV0. Furthermore, the macroblock in the reference frame R1 is a macroblock MBR1 that is specified by a motion vector MV1.
If pixel values in the macroblocks MBR0 and MBR1 (pixel values in motion compensation images) are represented by MC0(i, j) and MC1(i, j), then, a pixel value in either motion compensation image is used as a pixel value in a prediction image in unidirectional prediction. Thus, a prediction image Pred(i, j) is represented by Equation (1) below. (i, j) represents the relative position of a pixel in a macroblock, and satisfies 0≤i≤16 and 0≤j≤16. In Equation (1), “∥” indicates that one of the values MC0(i, j) and MC1(i, j) is taken.[Math. 1]Pred(i,j)=MC0(i,j)∥MC1(i,j)  (1)
Note that it is also possible to divide a single macroblock of 16×16 pixels into sub-blocks sized by 16×8 pixels or the like and to perform motion compensation on each of the sub-blocks by referring to a different reference frame. Instead of motion vectors with integer accuracy, motion vectors with decimal accuracy are transmitted and interpolation is performed using an FIR filter defined in a standard, thus making it possible to also use the pixel values of pixels around the corresponding position to be referred to for motion compensation.
FIG. 2 is a diagram illustrating an example of bidirectional prediction.
As illustrated in FIG. 2, in a case where a frame to be encoded B0 that is a frame at the present time, which is an encoding target, is to be generated through bidirectional prediction, motion compensation is performed using, as reference frames, encoded frames at times in the past and future in time with respect to the present time. The residual between a prediction image and an actual image is encoded by using a plurality of encoded frames as reference frames and by utilizing the correlation therewith, thus making it possible to reduce the amount of code. In H.264, it is also possible to use a plurality of frames in the past and a plurality of frames in the future as reference frames.
As illustrated in FIG. 2, in a case where one frame in the past and one frame in the future with respect to the frame to be encoded B0 are used as reference frames L0 and L1, a pixel value in an arbitrary macroblock in the frame to be encoded B0 can be predicted from the pixel values of arbitrary pixels in the reference frames L0 and L1.
In the example in FIG. 2, the macroblock in the reference frame L0 corresponding to a macroblock MBB0 in the frame to be encoded B0 is set as a macroblock MBL0 that is specified by a motion vector MV0. Furthermore, the macroblock in the reference frame L1 corresponding to the macroblock MBB0 in the frame to be encoded B0 is set as a macroblock MBL1 that is specified by a motion vector MV1.
If pixel values of the macroblocks MBL0 and MBL1 are represented by MC0(i, j) and MC1(i, j), respectively, then, the pixel value Pred(i, j) of a prediction image Pred(i, j) can be determined as the average value of these pixel values, as given in Equation (2) as follows.[Math. 2]Pred(i,j)=(MC0(i,j)+MC1(I,J)/2  (2)
In such motion compensation as above using unidirectional prediction, the accuracy of a prediction image is improved by increasing the accuracy of a motion vector or by reducing the size of a macroblock, and the residuals from the actual image are reduced, thereby achieving improvement in encoding efficiency.
Furthermore, in motion compensation using bidirectional prediction, the average of the pixel values of pixels in reference frames located close in time is used as the pixel value of a pixel in a prediction image, thus making feasible a probabilistically stable reduction in prediction residual.
FIG. 3 is a diagram illustrating an example of intra-prediction.
In the example in FIG. 3, the way prediction is performed from decoded neighboring pixels in the same screen to decode the current block of an encoded frame I0 is illustrated. In images, nearby pixel values generally have significantly high correlation. Thus, in this manner, prediction from neighboring pixels reduces residual components of the current block. Thereby, improvement in encoding efficiency is realized.
For example, in intra 4×4 prediction based on the H.264 standard, it is possible to predict the current block using nine methods by utilizing nearby encoded pixels. Two-dimensional directivity is incorporated into the correlation with nearby images, thus realizing improvement in prediction accuracy.
As another intra-prediction method, a technique exists in which a high-correlation area is copied from within the screen. Specifically, the technique is such that a specific position in a decoded image is specified in order to decode the current block and therefore the corresponding area is utilized for a prediction image of the current block.
This technique provides high prediction efficiency for a regular pattern or in a case where a plurality of objects having the same shape exist in a screen or in the like case.
As still another intra-prediction method, a technology also exists in which with the analysis of signal components in a characteristic area or a texture area existing in an encoding target image, the amount of code can be reduced by using an artificial synthetic image for an image to be encoded.
In this manner, with the emergence of various technologies for intra-prediction, the prediction accuracy of intra-prediction has been improved. In general moving images, however, the prediction accuracy of inter-prediction is still higher because, for example, even considerably complicated texture would provide almost zero prediction residual as a result of inter-prediction although it is difficult to increase the accuracy of intra-prediction in the case of stationary texture in a screen.
Furthermore, as another prediction method, a technique has been considered in which the correlation in the time direction is converted into the spatial resolution by motion compensation and FIR filtering of pixel values and the spatial resolution is utilized (see, for example, NPL 1).
In the method described in NPL 1, the correlation in the time direction is utilized for the process of increasing the resolution of an input image sequence. Specifically, difference information on a motion-predicted/compensated image between the current image and the previous image is calculated, and is fed back to the target current image to recover the high-frequency components included in the input image.