Conventionally, encoding schemes using motion compensation and the orthogonal transform such as the discrete cosine transform, Karhunen-Loève transform, or wavelet transform, including MPEG (Moving Picture Experts Group), H.26x, etc., have been generally utilized as encoding schemes in the case of handling moving images. In these moving image encoding schemes, the reduction in amount of code is achieved by utilizing the correlation in the spatial direction and time direction among the characteristics of an input image signal to be encoded.
For example, in H.264, unidirectional prediction or bidirectional prediction is used when an inter-frame that is a frame to be subjected to inter-frame prediction (inter-prediction) is generated utilizing the correlation in the time direction. Inter-frame prediction is designed to generate a prediction image on the basis of frames at different times.
Furthermore, in SVC (Scalable Video Coding), which is a standard extension of H.264, an encoding scheme that takes spatial scalability into account has been established. The SVC (H.264/AVC Annex G) is the up-to-date video coding standard that was standardized in November 2007 by ITU-T (International Telecommunication Union Telecommunication Standardization Sector) and ISO/IEC (International Organization for Standardization/International Electrotechnical Commission).
FIG. 1 illustrates a reference relationship to create a prediction image for compression that takes spatial scalability into account in the SVC. In the SVC, encoding is performed at a plurality of resolutions in, for example, a base layer and an enhancement layer illustrated in FIG. 1. In the case of the example in FIG. 1, as a base layer, an image having a resolution of n×m [pixel (pix)] (n and m are integers) is encoded using spatial scalability. Together with this, an image having a resolution of N×M [pixel (pix)] (N and M are integers, where N>n and M>m), as an enhancement layer, is encoded using spatial scalability.
In the case of the base layer, the current frame is encoded utilizing intra-prediction or inter-prediction similarly to the case of encoding based on the H.264 standard. In the case of the example in FIG. 1, when encoding of the base layer is performed, two reference planes (Ref0, Ref1) are used. Motion compensation images (MC0, MC1) from the individual reference planes are extracted, and inter-prediction is performed.
Also in the case of the enhancement layer, similarly to the case of the basic layer, the current frame can be encoded utilizing intra-prediction or inter-prediction.
In the case of intra-prediction, prediction is performed utilizing spatial correlation in the enhancement layer of the current frame. Intra-prediction is effective in a moving image to be encoded when the correlation in the time direction is low, such as when the subject moves a small amount. In general, in general moving images, however, in many cases, the correlation in the time direction is higher than prediction in the spatial direction, and intra-prediction cannot be said to be optimum in terms of encoding efficiency.
In the case of inter-prediction, decoded images in the enhancement layer of temporally preceding or following frames are used as reference planes. Inter-prediction uses correlation in the time direction, and thus makes high encoding efficiency feasible. However, it is necessary that it be necessary to decode in advance high-resolution frame images in the enhancement layer that serve as reference planes. Furthermore, it is also necessary to save the high-resolution images in a memory in order to utilize them for reference. Moreover, it is necessary to read the high-resolution images having a large amount of data from the memory. Accordingly, inter-prediction can be said to be a scheme that imposes a large load in terms of the amount of processing and implementation cost.
In this regard, in the case of the enhancement layer, in addition to the above two schemes, a prediction method based on spatial upsampling (upconversion) of the base layer (hereinafter referred to as upconversion prediction) can be used to encode the current frame.
An image in the base layer is a low-resolution version of an image in the enhancement layer, and can therefore be considered to include a signal corresponding to the low-frequency components of the image in the enhancement layer. That is to say, the image in the enhancement layer can be obtained by adding high-frequency components to the image in the base layer. Upconversion prediction is a method for performing prediction utilizing such correlation between layers, and is a prediction method useful to improve encoding efficiency particularly in a case where intra- or inter-prediction does not apply. Furthermore, this prediction method decodes the image in the enhancement layer of the current frame merely by decoding the image at the same time in the base layer, and can therefore be said to be a prediction scheme that is excellent (that imposes a small load) also in terms of the amount of processing.
Meanwhile, processes for increasing resolution include a technique for performing motion compensation and FIR filtering of pixel values to convert the correlation in the time direction into the spatial resolution for utilization. (See, for example, NPL 1).
In the method described in NPL 1, the correlation in the time direction is utilized for the process for increasing the resolution of an input image sequence. Specifically, difference information on a motion-predicted/compensated image between the current image and the previous image is calculated, and is fed back to the target current image to recover the high-frequency component included in the input image.