Field of the Invention
The present invention relates to a video encoding method, a video decoding method, a video encoding apparatus, a video decoding apparatus, a video processing system, a video encoding program, and a video decoding program.
Related Background Art
Conventionally, video signal encoding techniques are used for transmission, storage, and regeneration of video signals. The well-known techniques include, for example, the international standard video coding methods such as ITU-T Recommendation H.263 (hereinafter referred to as “H.263”), ISO/IEC International Standard 14496-2 (MPEG-4 Visual, hereinafter referred to as “MPEG-4”), and so on.
Another known newer encoding system is a video coding method scheduled for joint international standardization by ITU-T and ISO/IEC; ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10 (Joint Final Committee Draft of Joint Video Specification, hereinafter referred to as “H.26L”). Concerning the general coding techniques used in these video coding methods, reference should be made, for example, to nonpatent Document 1 presented below.
[Nonpatent Document 1]
Basic Technologies on International Image Coding Standards
(co-authored by Fumitaka Ono and Hiroshi Watanabe and published Mar. 20, 1998 by CORONA PUBLISHING CO., LTD.)
A motion video signal consists of a series of images (frames) varying little by little with time. For this reason, it is common practice in these video coding methods to implement interframe prediction between a frame retrieved as a target for encoding (current frame) and another frame (reference frame) and thereby reduce temporal redundancy in the video signal.
In this case, where the interframe prediction is carried out between the current frame and a reference frame with smaller difference from the current frame, the redundancy can be reduced more and encoding efficiency can be increased. For this reason, the reference frame can be either a temporally previous frame or a temporally subsequent frame with respect to the current frame. The prediction with reference to the previous frame is referred to as forward prediction, while the prediction with reference to the subsequent frame as backward prediction (cf. FIG. 1). Bidirectional prediction is defined as a prediction in which one is arbitrarily selected out of the two prediction methods is arbitrarily selected, or in which the both methods are used simultaneously.
In general, with use of such bidirectional prediction, a temporally previous frame as a reference frame for forward prediction and a temporally subsequent frame as a reference frame for backward prediction are each stored each in a frame buffer, prior to the current frame.
For example, in decoding of MPEG-4, where the current frame is decoded by bidirectional interframe prediction, a temporally previous frame and a temporally subsequent frame with respect to the current frame are first decoded as either frames decoded by intraframe prediction without use of interframe prediction, or as frames decoded by forward interframe prediction, prior to decoding of the current frame, and they are stored as reference frames into the frame buffer. Thereafter, the current frame is decoded by bidirectional prediction using these two frames thus stored (cf. FIG. 2(a)).
In this case, therefore, the order of decoding times of the temporally subsequent reference frame and the current frame is reverse to the order of output times of the respective decoded images thereof. Each of these frames is attached with information indicating its output time, and thus the temporal order of the frames can be known according to this information. For this reason, the decoded images are outputted in the correct order (cf. FIG. 2(b)). In MPEG-4, the output times are described as absolute values.
Some of the recent video coding methods permit the foregoing interframe prediction to be carried out using multiple reference frames, instead of one reference frame in the forward direction and one reference frame in the backward direction, so as to enable prediction from a frame with a smaller change from the current frame (cf. FIG. 3).
For example, in decoding of H.26L, a plurality of reference frames within the range up to the predetermined maximum number of reference frames are retained in the frame buffer and an optimal reference frame is arbitrarily designated out of them on the occasion of implementing interframe prediction. In this case, where the current frame is decoded as a bidirectionally predicted frame, reference frames are first decoded prior to decoding of the current frame; a plurality of temporally previous frames and a plurality of temporally subsequent frames with respect to the current frame are decoded each as reference frames and retained as reference frames in the frame buffer. The current frame can be predicted from a frame arbitrarily designated as one used for prediction out of those frames (cf. FIG. 4(a)).
In this case, therefore, the order of decoding times of the temporally subsequent reference frames and the current frame becomes reverse to the order of output times thereof. Each of these frames is attached with information indicating its output time or with information indicating the output order, and the temporal order of the frames can be known according to this information. For this reason, the decoded images are outputted in the correct order (cf. FIG. 4(b)). The output times are often described as absolute values. The output order is used where frame intervals are constant.
In the case where the multiple reference frames are also used in backward prediction, as described above, the frames retained in the frame buffer are not always used in backward prediction for frames after the current frame. An example of this case will be described with reference to the predictive structure shown in FIG. 5. Let us assume that the current frame F1 is backward predicted from a temporally subsequent reference frame F2, F2 from F3, and F3 from F4 and that F4 is forward predicted from a temporally previous reference frame F0. Such predictions are carried out as efficient prediction operation, for example, in the case where a change is large between the temporally previous reference frame F0 and the current frame F1, while changes are small between F1 and the temporally subsequent reference frames F2, F3, F4, and a change is relatively small between F0 and F3.
In this case, the current frame F1 is predicted from only the temporally subsequent reference frame F2, and thus F3 and F4 are frames that are not used for interframe prediction at the time of decoding the current frame F1. However, since F3 and F4 are temporally subsequent frames after the current frame F1, they need to be continuously retained before they are outputted as decoded images at their respective output times.
When the temporally subsequent frames are retained for the backward prediction in the frame buffer in this way, such frames are classified into two types, those used as reference frames and those not used as reference frames in the interframe prediction after the current frame. In the description hereinafter, the frames not used as reference frames but retained in the frame buffer before the coming of their output times will be referred to as “output queuing frames.”
In order to explain the difference of the frames, schematic illustrations of a configuration of a video decoding device are presented in FIG. 6(a) and FIG. 6(b). As shown in FIG. 6(a), the decoding device 1 is provided with frame buffer 3 for retaining reference frames, and the frame buffer 3 outputs a reference frame to decoding processor 2 in execution of interframe prediction. In this case, where a plurality of reference frames are used in backward prediction as described above, the frame buffer retains both the reference frames and output queuing frames and, from a logical aspect, as shown in FIG. 6(b), there exist an area for storing frames continuously retained as reference frames for a fixed time and also outputted to the decoding processor 2, and an area for storing frames not outputted to the decoding processor 2 but continuously retained before outputted as decoded images at output times of the respective frames.
Incidentally, in the case of the multiple reference frames being used, for example, if there is a frame having a peculiar feature in a certain moving picture and having large changes from the other frames, no effective prediction can be expected even if that frame is kept retained as a reference frame. Therefore, the interframe prediction can be performed more efficiently in certain cases by stopping retaining such frames as reference frames and allowing the frame buffer to retain other frames by just that much. Conversely, in the case where a frame has a typical feature in a certain moving picture and has small changes from the other frames, the interframe prediction can be expected to be carried out efficiently for many frames if such a frame is retained as a reference frame in the frame buffer for a long period, regardless of the temporal distance from the current frame.
In order to substantialize such eclectic operation of reference frames, it is conceivable to announce eclectic information of the reference frames by encoded data. For example, in H.26L, Memory Management Control Operation (MMCO) commands are defined. The MMCO commands include, for example, definitions of a Reset command capable of providing an instruction to eliminate use of all the reference frames retained in the frame buffer, and other commands, and it is possible to arbitrarily provide an instruction to choose any frame to be retained as a reference frame in the frame buffer as occasion demands.
For starting decoding from the middle of encoded data in order to make random access on a moving picture, necessary conditions are that a start frame to be decoded is a frame encoded by intraframe prediction without use of interframe prediction from another frame and that frames after the start frame do not use any previous frame before the decoding-start frame, as a reference frame, i.e., an instruction to eliminate use of all the reference frames retained in the frame buffer needs to be given prior to the decoding of the decoding-start frame.
For example, in H.26L, an Instantaneous Decoder Refresh (IDR) picture is defined in order to clearly specify such a state. With the IDR picture, use is eliminated of all the previous reference frames before decoding of the IDR picture and interframe predictions for frames thereafter are those not referring to the frames before the IDR picture. This permits decoding to be carried out in the middle of encoded data, like in random access, without facing the problem of presence/absence of the reference frame when decoding start from an IDR picture.