Multi-viewpoint video images are a plurality of video images obtained by photographing the same object and background thereof using a plurality of cameras. Below, a video image obtained by a single camera is called a “two-dimensional video image”, and a set of multiple two-dimensional video images obtained by photographing the same object and background thereof is called a “multi-viewpoint video image”.
There is a strong temporal correlation in the two-dimensional video image of each camera, which is included in a multi-viewpoint video image. In addition, when the cameras are synchronized with each other, the images (taken by the cameras) corresponding to the same time capture the object and background thereof in entirely the same state from different positions, so that there is a strong correlation between the cameras. The encoding efficiency of video encoding can be improved using this correlation.
First, conventional techniques relating to the encoding of two-dimensional video images will be shown.
In many known methods of encoding two-dimensional video images, such as MPEG-2 and H.264 (which are international video encoding standards), and the like, high encoding efficiency is obtained by means of interframe prediction encoding which uses a temporal correlation.
The interframe prediction encoding executed for encoding two-dimensional video images uses a temporal variation in a video image, that is, a motion. Therefore, the method used in the interframe prediction encoding is generally called “motion compensation”. Accordingly, the interframe prediction encoding along a temporal axis is called “motion compensation”, below. In addition, “frame” is an image which is a constituent of a video image and is obtained at a specific time.
Generally, two-dimensional video encoding has the following encoding modes for each frame: “I frame” encoded without using an interframe correlation, “P frame” encoded while performing motion compensation based on one already-encoded frame, and “B frame” encoded while performing motion compensation based on two already-encoded frames.
In order to further improve the efficiency of video image prediction, in H.263 and H.264, decoded images of a plurality of frames (i.e., two frames or more) are stored in a reference image memory, and a reference image is selected from the images of the memory to perform prediction.
The reference image can be selected for each block, and reference image specification information for specifying the reference image can be encoded to perform the corresponding decoding.
In motion compensation, in addition to the reference image specification information, a vector for indicating a position in the reference image is encoded, where a target block is encoded by using the position, and the vector is called a “motion vector”.
In encoding of the motion vector in MPEG-4 or H.264, a predicted vector is generated using a motion vector of a block adjacent to an encoding target block, and only a differential vector between the predicted vector and the motion vector used in motion compensation applied to the target block. In accordance with this method, when motion continuity is present between the relevant adjacent blocks, the motion vector can be encoded with a high level of encoding efficiency.
Non-Patent Document 1 discloses in detail a process of generating a predicted vector in H.264, and the general explanation thereof is presented below.
In H.264, as shown in FIG. 20A, based on motion vectors (mv_a, mv_b, and mv_c) used in a left side block (see “a” in FIG. 20A), an upper side block (see “b” in FIG. 20A), and an upper-right side block (see “c” in FIG. 20A) of an encoding target block, horizontal and vertical components are obtained by computing the median for each direction.
As H.264 employs a variable block size motion compensation, the block size for motion compensation may not be the same between the encoding target block and peripheral blocks thereof. In such a case, as shown in FIG. 20B, block “a” is set to the uppermost block among left side blocks adjacent to the target block, block “b” is set to the leftmost block among upper side blocks adjacent to the target block, and block “c” is set to the closest upper-left block.
As an exception, if the size of the target block is 8×16 pixels, as shown in FIG. 20C, instead of the median, block “a” and block “c” are respectively used for the left and right sides to perform prediction. Similarly, if the size of the target block is 16×8 pixels, as shown in FIG. 20D, instead of the median, block “a” and block “b” are respectively used for the lower and upper sides to perform prediction.
As described above, in H.264, a reference frame is selected for each block from among a plurality of already-encoded frames, and is used for motion compensation.
Generally, the motion of the imaged object is not uniform and depends on the reference frame. Therefore, in comparison with a motion vector in motion compensation using a reference frame different from that of the encoding target block, a motion vector in motion compensation using the same reference frame as the target block should be closer to a motion vector used for the target block.
Therefore, in H.264, if there is only one block (among the blocks a, b, and c) whose reference frame is the same as that of the encoding target block, then instead of the median, the motion vector of the relevant block is used as a predicted vector so as to generate a predicted vector having a relatively higher level of reliability.
When there is motion continuity through a plurality of frames, for example, when an object performs a linear uniform motion, a method for encoding a motion vector with a high level of encoding efficiency is known, in which a motion vector of each frame immediately before in the encoding order is accumulated, and information of the motion vector is subjected to scaling in accordance with the relevant time interval, so as to compute a motion vector.
In order to detect the time interval, output time of each frame is used as information.
Generally, such time information is encoded for each frame because the time information is necessary when, for example, the input order and the encoding order of the taken images differ from each other, and the images are decoded in the order of the imaging time. That is, on the encoder side, each frame is encoded while setting time information assigned to each input image in accordance with the input order, and on the decoder side, the decoded image of each frame is output in the order designated by the set time information.
In H.264, a so-called “temporal direct mode” is a method for encoding a motion vector with a high level of encoding efficiency, by using motion continuity through a plurality of frames.
For example, for frames A, B, and C shown in FIG. 21, it is assumed here that frames A, C, and B are sequentially encoded in this order, and frame C has been encoded using frame A as a reference frame so as to perform motion compensation. In such a state, in the temporal direct mode, the motion vector of a block in frame B is computed as explained below.
First, a motion vector mv, that is used on a block which belongs to frame C and is at the same position as an encoding target block, is detected.
Next, in accordance with the following formulas, a motion vector fmv when regarding frame A as a reference frame and a motion vector bmv when regarding frame C as a reference frame are computed.fmv=(mv×TAB)/TAC bmv=(mv×TBC)/TBC 
where TAB, TBC, and TAC are respectively the time interval between frames A and B, the time interval between frames B and C, and the time interval between frames A and C.
In H.264, the temporal direct mode can be used only for “B frame” (Bi-predictive frame) which uses two reference frames for each block.
Non-Patent Document 2 shown later employs an application of the above mode so that also in P frame that uses only one reference frame for each block, the motion vector can be efficiently encoded.
Additionally, Non-Patent Document 3 discloses a method for efficiently encoding the motion vector by assuming both the motion continuity between adjacent blocks and the motion continuity thorough a plurality of frames.
FIGS. 22A to 22D show the general concept thereof. In this method, similar to H.264 and MPEG-4, a predicted vector is generated using a motion vector of a peripheral block around an encoding target block, and only the differential vector between the predicted vector and a motion vector that is used in actual motion compensation is encoded (see FIG. 22A).
In contrast with H.264, etc., the motion vector of a peripheral block is not directly used, but used after subjecting the motion vector to scaling in accordance with the relevant time interval, by using the following formula.mv—k′=mv—k×Tct/Tck 
where mv_k is an original motion vector, mv_k′ is a motion vector after the scaling, Tct is the time interval between the encoding target frame and a frame to be referred to by the encoding target block, and Tck is the time interval between the encoding target frame and a frame referred to by a peripheral block of the target block (see FIGS. 22B to 22D).
Below, conventional encoding methods for multi-viewpoint video images will be explained.
Generally, multi-viewpoint video encoding uses a correlation between cameras, and a high level of encoding efficiency is obtained by using “disparity compensation” in which motion compensation is applied to frames which are obtained at the same time by using different cameras.
For example, MPEG-2 Multiview profile or Non-Patent Document 4 employ such a method.
In the method disclosed in Non-Patent Document 4, any one of motion compensation and disparity compensation is selected for each block. That is, one having a higher encoding efficiency is selected for each block, so that both the temporal correlation and the inter-camera correlation can be used. In comparison with a case of using only one type of correlation, a higher encoding efficiency is obtained.
In disparity compensation, in addition to a prediction residual, a disparity vector is also encoded. The disparity vector corresponds to the motion vector for indicating a temporal variation between frames, and indicates a difference between positions on image planes, which are obtained by cameras arranged at different positions, and onto which a single position on the imaged object is projected.
FIG. 23 is a schematic view showing the concept of disparity generated between such cameras. In the schematic view of FIG. 23, image planes of cameras, whose optical axes are parallel to each other, are observed vertically from the upper side thereof.
In the encoding of the disparity vector, similar to the encoding of the motion vector, it is possible that a predicted vector is generated using a disparity vector of a block adjacent to the encoding target block, and only a differential vector between the predicted vector and the disparity vector used in disparity compensation applied to the target block is encoded. In accordance with such a method, when there is disparity continuity between the relevant adjacent blocks, the disparity vector can be encoded with a high level of encoding efficiency.    Non-Patent Document 1: ITU-T Rec.H.264/ISO/IEC 11496-10, “Editor's Proposed Draft Text Modifications for Joint Video Specification (ITU-T Rec. H.264/ISO/IEC 14496-10 AVC), Draft 7”, Final Committee Draft, Document JVT-E022, pp. 63-64, and 117-121, September 2002.    Non-Patent Document 2: Alexis Michael Tourapis, “Direct Prediction for Predictive (P) and Bidirectionally Predictive (B) frames in Video Coding,” JVT-C128, Joint Video Team (JVT) of ISO/IEC MPEG&ITU-T VCEG Meeting, pp. 1-11, May, 2002.    Non-Patent Document 3: Sadaatsu Kato and Choong Seng Boon, “Motion Vector Prediction for Multiple Reference Frame Video Coding Using Temporal Motion Vector Normalization”, PCSJ2004, Proceedings of the 19th Picture Coding Symposium of Japan, P-2.18, November 2004.    Non-Patent Document 4: Hideaki Kimata and Masaki Kitahara, “Preliminary results on multiple view video coding (3DAV)”, document M10976MPEG Redmond Meeting, July, 2004.