In the international standard video image coding such as MPEG-1, MPEG-2 and H.261, H263, the output time of each frame is encoded. This time information is called TR (Temporal Reference), which is encoded at fixed length for each frame. By setting in advance a time interval which becomes a reference in a system, and a time from a sequence top is indicated by a product of that time interval and TR. At the encoder, each frame is encoded by setting a time information of the input image as TR, and at the decoder, the decoded image of each frame is outputted at a time specified by TR.
On the other hand, in general, in the video image coding, the inter-frame predictive coding is used in order to realize a high coding efficiency by using a correlation in a time direction. The frame encoding modes include an I frame which is encoded without using a correlation between frames, a P frame which is predicted from an I frame encoded in the past, and a B frame which can be predicted from two frames encoded in the past.
In the B frame, there is a need to store the decoded images for two frames in a reference image memory. In particular, in the video coding scheme H.263 and H.264, the decoded image for a plurality of frames greater than or equal to two frames are stored in advance in the reference image memory, and the prediction can be made by selecting a reference image from that memory.
The reference image can be selected for each block, and a reference image specifying data for specifying the reference image is encoded. The reference image memory has one for short term (STRM) and one for long term (LTRM), where the decoded images of the current frames are sequentially stored into the STRM, while the images stored in the STRM are selected and stored into the LTRM. Note that the control method of the STRM and the LTRM is described in the non-patent reference 1, for example.
Non-patent reference 1: Thomas Wiegand, Xiaozheng Zhang, and Berned Girod, “Long-Term Memory Motion-Compensated Prediction”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 1, pp.70-84, Feb., 1999.
In the B frame of MPEG-1, MPEG-2, a method for predicting from a frame of the further past is referred to as a forward inter-frame prediction, and a method for predicting from a frame of the further future is referred to as a backward inter-frame prediction. A display time of the reference frame in the backward inter-frame prediction is further in future than the current frame. In this case, after displaying the current frame, the reference frame of the backward inter-frame prediction will be outputted. In the case of predicting the B frame from two frames (bidirectional inter-frame prediction), one frame of the image data is produced by interpolating the image data from two frames, and this is set as the predicted image.
In FIG. 16(A), an example of the prediction relationship of the video images in the case where the display time of the reference frame in the backward inter-frame prediction is a future is shown. (1)-(7) shown in FIG. 16 indicates frame numbers. In the case of encoding with the encoding modes of the first frame to the seventh frame in an order of IBBPBBP, there is a prediction relationship shown in FIG. 16(A), so that in the case of actually encoding, the frames are encoded in the order of 1423756 as shown in FIG. 16(B). The order of TR encoded in this case takes values corresponding to 1423756 similarly as the encoded frames.
In the B frame of H.264, the concept of the backward inter-frame prediction is further expanded than MPEG-1, MPEG-2, and the display time of the reference frame in the backward inter-frame prediction may be further in past than the current frame. In this case, the reference frame of the backward inter-frame prediction will be outputted earlier.
As noted above, in H.264, a plurality of decoded images can be stored in the reference image memory. For this reason, a reference image specifying data L0 for the forward inter-frame prediction and a reference image specifying data L1 for the backward inter-frame prediction are defined, and each one of the reference image for the forward inter-frame prediction and the reference image for the backward inter-frame prediction is specified independently.
In order to specify the reference image for each block, the prediction mode (the forward inter-frame prediction, or the backward inter-frame prediction, or the bidirectional inter-frame prediction) of the block is encoded first, the reference image specifying data L0 is encoded in the case where the prediction mode is the forward inter-frame prediction, the reference image specifying data L1 is encoded in the case of the backward inter-frame prediction, and the reference image specifying data L0 and the reference image specifying data L1 are encoded in the case of the bidirectional inter-frame prediction.
By defining in this way, there is no need for the display time of the reference frame in the backward inter-frame prediction to be further in future than the current frame. In the B frame of H.264, the past frame can be specified as the reference image even in the backward inter-frame prediction in this way, and moreover the specification can be changed in block units, so that the prediction image similar to the P frame can be produced except for the case of the bidirectional inter-frame prediction.
In FIG. 17(A), an example of the prediction relationship of the video images in the case where the display time of the reference frame in the backward inter-frame prediction is a past is shown. Unlike the case of FIG. 16, even in the case of encoding with the encoding modes of the first frame to the seventh frame in an order of IBBPBBP, there is a prediction relationship shown in FIG. 17(A), so that the frames are encoded in the order of 1423567 as shown in FIG. 17(B).
In the method for inter-frame coding by selecting the reference image by storing a plurality of decoded images in the reference image memory in advance, there is no need to store the decoded images of all frames. By utilizing this, it is possible to realize the time scalable function.
For example, in the case where there is a prediction relationship such as FIG. 16(A) in MPEG-1, MPEG-2, the B frames (frame numbers (2), (3), (5), (6)) will not be used as the reference image at the subsequent frames. For this reason, the decoding side can decode only I frames and P frames and does not decode B frames. Assuming that they are originally encoded at 30 frames per second, it is possible to output video of 10 frames per second by making it not to decode/output B frames.
Such a technique can also be applied to the multiple layers. FIG. 1 is a figure showing an example of the prediction relationship in the three layer configuration. In FIG. 1, (1)-(9) indicates frame numbers, and numerals 1-9 described inside frames indicate the encoding order of each frame.
For example, as shown in FIG. 1(C), in the case where the fifth frame (first layer) uses the first frame as the reference frame, the third frame (second layer) uses the first frame or the fifth frame as the reference frame, the second frame (third layer) uses the first frame or the third frame as the reference frame, and the fourth frame (third layer) uses the third frame and the fifth frame as the reference frames, and in the case where all five frames are the video of 30 frames per second, it is possible to output video of 15 frames per second by not decoding the second frame and the fourth frame (third layer).
Also, by not decoding the second frame, the third frame and the fourth frame (second layer and third layer), it is possible to output video of 7.5 frames per second. Note that, besides FIG. 1(C), the frame encoding order can be set in a plurality of patterns, and it may be made the same as the input order as in FIG. 1(A), and it may be made such that the second layer is encoded immediately after encoding the first layer and then the third layer is encoded as in FIG. 1(B), for example.
In the case where there are frames which will not be set as the reference frame in this way, the mechanism for changing the time resolution may be executed by the decoding side, or may be executed at a relay point between the encoding side and the decoding side. In the case of delivering the encoded data in unidirection as in the broadcasting, it is preferable to execute it by the decoding side.
Also, such a time scalable function can be applied to the coding of the multiple viewpoint video by regarding layers of FIG. 1 as viewpoints.
Also, even a plurality of frames in general in which there is no time relationship among frames can be handled as the video image by arranging the plurality of frames on dimensions set up in advance and regarding that dimension as time. It is also possible to apply the time scalable function by classifying such a plurality of frames into a smaller number of sets, and regarding them as layers in FIG. 1.
Also, as a method for realizing the time scalable coding, there is the MCTF coding. This MCTF coding method is a method in which the filtering (sub-band division) is applied in time direction with respect to the video data, and the energy of the video data is compactified by utilizing a correlation in time direction of the video data. FIG. 18 shows a conceptual diagram for dividing the lower band in octaves in time direction. GOP is set up and the filtering is applied in time direction within GOP. For the filter in time direction, the Haar basis is proposed in general (see non-patent reference 2).
Non-patent reference 2: Jens-Rainer Ohm, “Three-Dimensional Subband Coding with Motion Compensation”, IEEE Trans. Image Proc., vol. 3, no. 5, pp. 559-571, 1994.
Also, in general, the Lifting Scheme as shown in FIG. 19 can be applied to the Haar basis. By this scheme, the filtering can be made with less calculation amount. In this Lifting Scheme, predict is the processing similar to the ordinary predicting coding, which is the processing for obtaining a remaining difference between the predicted image and the original image.
Note that the methods for obtaining the image in high resolution from a plurality of images are described in non-patent reference 3 and non-patent reference 4.
Non-patent reference 3: Sung Cheol Park, Min Kyu Part, and Moon Gi Kang, “Super-Resolution Image Reconstruction: A Technical Overview”, IEEE Signal Processing Magazine, pp.21-36, May, 2003.
Non-patent reference 4: C. Andrew Segall, Rafael Molina, and Aggelos K. Katsaggelos, “High-Resolution Image from Low-Resolution Compress Video”, IEEE Signal Processing Magazine, pp. 37-48, May, 2003.
In the case of being equipped with the reference image memory for a plurality of frames, the coding efficiency improves when the maximum number of frames to be stored is made larger. Here, in the case of realizing the time scalable function, even in the case where the number of layers to be decoded becomes less, there is a need to specify the identical decoded image by the reference image specifying data in the encoded data.
However, in the conventional H.264, even though the STRM and the LTRM are equipped, the LTRM is a memory for storing images stored in the STRM and the decoded images are stored into the STRM, so that the reference image specifying data is encoded with respect to the decoded image regardless of layers in the time scalable function.
Consequently, in the case of not decoding a particular frame of the encoded data at the decoding side, frames with different reference image specifying data will be referred. When the predicted image is produced from different reference images in this way, the correct decoded image cannot be obtained at the decoding side.
In the case of not storing the decoded images in the reference image memory and limiting the reference images to the preceding or following I frame or P frame as in the B frame of MPEG-1, MPEG-2, rather than selecting the reference image from a plurality of frames by using the reference image specifying data, there is no case in which the reference images are different in the case of not decoding the B frame. By this the time scalable coding can be realized. However, if the decoded image of the B frame is not stored in the reference image memory, the B frame has the reference image limited to the preceding or following I frame or P frame and it is not equipped with the reference image memory for a plurality of frames, so that the coding efficiency cannot be improved.
As described above, in the conventional method for realizing the time scalable coding, it cannot be equipped with the reference image memory for a plurality of frames in order to improve the coding efficiency, and conversely, in the conventional method for storing a plurality of frames into the reference image memory, the time scalable coding cannot be realized.