International standards for moving picture encoding such as MPEG-1, MPEG-2, H.261 and H.263 encode the output time of each frame. This temporal information is called the temporal reference (TR), which is fixed-length encoded on a frame-by-frame basis. The time interval serving as the reference in the system is set in advance, indicating the time from the start of the sequence by the product of the time interval and the TR. In the encoder, each frame is encoded by setting the temporal information of the input picture in the TR, and in the decoder, the decoded picture of each frame is output at the time designated by the TR.
Meanwhile, inter-frame prediction encoding is generally employed in order to achieve a high encoding efficiency using the correlation in the temporal domain in moving picture encoding. Frame encoding modes include I-frame encoding which does not use inter-frame correlation, P-frame encoding which uses one previously encoded frame to predict a future frame, and B-frame encoding that can perform frame prediction from two previously encoded frames.
In B-frame encoding, it is therefore necessary to store a decoded picture of two frames in a reference picture memory. In particular, the video encoding schemes H.263 and H.264 can predict frames by storing a decoded picture of two or more frames in the reference picture memory, and selecting the reference picture from the memory. The reference picture can be selected for each block, and reference picture designation information that designates the reference picture is encoded. The reference picture memory includes short-term reference memory (STRM) and long-term reference memory (LTRM). STRM stores the decoded picture of the current frame, while LTRM selects and stores the picture stored in STRM. For example, Non-patent Document 1 given below can be cited as a document that discloses a control method of LTRM and STRM.
In the B-frame encoding of MPEG-1 and MPEG-2, a method that predicts from past frames is called forward inter-frame prediction, and a method that predicts from future frames is called backward inter-frame prediction. The display time of the reference frame in backward inter-frame prediction is further in the future than the present frame. In this case, after the display of the current frame, the reference frame of backward inter-frame prediction is output. In the case of predicting from two frames in B-frame encoding (bidirectional inter-frame prediction), the picture information of two frames is interpolated to create the picture information of one frame, which serves as the prediction picture.
FIG. 1 shows an example of the predictive relation of a moving picture in the case of the display time of the reference frame in backward inter-frame prediction being in the future. When performing encoding with the encoding modes of the first through seventh frames in the order of IBBPBBP, the predictive relation shown in the upper side of FIG. 1 (IBBPBBP) exists. Therefore, when actually encoding, the frames are encoded in the order of 1423756 as shown in the lower side of FIG. 1. The order of the TR encoded in this case becomes a value corresponding to 1423756, similarly to the encoded frames.
The concept of backward inter-frame prediction in B-frame encoding of the H.264 expands on that of MPEG-1 and MPEG-2, in that the display time of the reference frame in backward inter-frame prediction may be further in the past than the present frame. In this case, the reference frame in backward inter-frame prediction is output first. Although noted above, in the H.264, a plurality of decoded pictures can be stored in the reference picture memory. Therefore the reference picture designation information L0 for forward inter-frame prediction and the reference picture designation information L1 for backward inter-frame prediction are defined to independently designate the reference picture for forward inter-frame prediction and the reference picture for backward inter-frame prediction.
To designate the reference picture for each block, first the block prediction mode (forward inter-frame prediction, backward inter-frame prediction, or bidirectional inter-frame prediction) is encoded. When the prediction mode is forward inter-frame prediction the reference picture designation information L0 is encoded. When the prediction mode is backward inter-frame prediction the reference picture designation information L1 is encoded. When the prediction mode is bidirectional inter-frame prediction the reference picture designation information L0 and the reference picture designation information L1 are encoded.
When the definition is given in this way, there is no need for the display time of the reference frame in backward inter-frame prediction to be in the future of the present frame. In the B-frame encoding of the H.264, backward inter-frame prediction can thus designate a past frame as a reference picture, and moreover since the designation can be changed on a block by block basis, except for bidirectional inter-frame prediction, a prediction image identical to P-frame encoding can be created.
FIG. 2 shows an example of the predictive relation of a moving picture in the case of the display time of the reference frame in backward inter-frame prediction being in the past. Unlike the case of FIG. 1, even when encoding is performed with the encoding modes of the first frame through the seventh frame in the order of IBBPBBP, since there is the predictive relation (IBBPBBP) shown on the upper side of FIG. 2, the frames are encoded in the order of 1423567 as shown in the lower side of FIG. 2.
As a method of B-frame motion vector encoding, the temporal direct mode scheme has been proposed. This technique is adopted in the H.264 international standard. This is a method of storing the latest P-frame motion vector in an encoded order and scaling the motion vector information by a time interval to compute the motion vector.
Regarding frames a, b, and c shown in FIG. 3, they are encoded in the order of frame a, frame b, and frame c, with the frame a and the frame c being P-frame encoded, and the frame b being B-frame encoded. When the motion vector of the same position block of the P-frame is mv, the forward prediction motion vector fmv and the backward prediction motion vector bmv of the current block of the B frame encoding are computed by Equation 1.fmv=(mv×TRab)/TRac bmv=(mv×TRbc)/TRac  (1)
TRab, TRbc, and TRac, respectively, indicate the time interval between the frame a and the frame b, the time interval between the frame b and the frame c, and the time interval between the frame a and the frame c. As technology that applies this, Non-patent Document 2 below proposes a method of storing the latest P-frame motion vector in the encoding order to be used as the current P-frame motion vector. According to such schemes, when there is continuity of motion between a plurality of frames to be continuously encoded, the motion vector can be efficiently encoded.
By having a constitution that does not store such B-frame decoded images in the reference picture memory, even if the B-frame is not decoded, the next frame can be decoded. Thereby the frame rate can be lowered by not decoding the B frame, and a temporal-scalable function can be achieved.
Also, in the H.264, as shown in FIG. 4, the macroblock is divided into two or four parts, and when divided into four parts, a tree structure can be constituted that can further divide a region of 8 vertical and horizontal pixels into two or four parts. It is possible for each divided region to have a different motion vector. The reference picture can be selected in units of two or four divisions of the macroblock. This type of macroblock partition pattern is encoded as encoded mode information.
Also, as a scheme of realizing temporal scalable encoding, there is motion compensated temporal filtering (MCTF). This MCTF encoding method is a scheme that performs filtering (sub-band partitioning) in the time domain with respect to the video data and uses the correlation in the time domain of the video data to make the video data energy compact.
FIG. 5 is a conceptual diagram of octave partitioning of the low band region in the time domain. A group of pictures (GOP) is set and filtering is performed in the time domain within the GOP. When applying a filter in the time domain, motion compensation may be performed. In the filter of the time domain the Haar basis is generally proposed (refer to Non-patent Document 3).
Generally in the Haar basis, the lifting scheme can be applied as shown in FIGS. 6A and 6B. By this scheme, filtering can be performed with a small amount of computation. In this lifting scheme, “predict” is a process identical to normal prediction encoding, being a process to determine the residual of the prediction picture and the original picture.
Non-patent Document 1: Thomas Wiegand, Xiaozheng Zhang, Bemd Girod, “Long-Term Memory Motion-Compensated Prediction,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 1, pp. 70-84, February. 1999.
Non-patent Document 2: Alexis Michael Tourapis, “Direct Prediction for Predictive (P) and Bidirectionally Predictive (B) Frames in Video Coding,” JVT-C128, Joint Video Team (JVT) of ISO/IEC MPEG&ITU-T VCEG Meeting, May 2002.
Non-patent Document 3: Jens-Rainer Ohm, “Three-Dimensional Subband Coding with Motion Compensation,” IEEE Trans. Image Proc., Vol. 3, No. 5, pp. 559-571, 1994.