Multi-viewpoint video images are a plurality of video images obtained by photographing the same object and background thereof using a plurality of cameras. Below, a video image obtained by a single camera is called a “two-dimensional video image”, and a set of multiple two-dimensional video images obtained by photographing the same object and background thereof is called a “multi-viewpoint video image”.
There is a strong temporal correlation in the two-dimensional video image of each camera, which is included in a multi-viewpoint video image. In addition, when the cameras are synchronized with each other, the images (taken by the cameras) corresponding to the same time capture the object and background thereof in entirely the same state from different positions, so that there is a strong correlation between the cameras. The encoding efficiency of video encoding can be improved using this correlation.
First, conventional techniques relating to the encoding of two-dimensional video images will be shown.
In many known methods of encoding two-dimensional video images, such as MPEG-2 and H.264 (which are international video encoding standards), and the like, high encoding efficiency is obtained by means of interframe predictive encoding which uses a temporal correlation.
The interframe predictive encoding executed for encoding two-dimensional video images uses a temporal variation in a video image, that is, a motion. Therefore, the method used in the interframe predictive encoding is generally called “motion compensation”. Accordingly, the interframe predictive encoding along a temporal axis is called “motion compensation”, below. In addition, “frame” is an image which is a constituent of a video image and is obtained at a specific time.
Generally, two-dimensional video encoding has the following encoding modes for each frame: “I frame” encoded without using an interframe correlation, “P frame” encoded while performing motion compensation based on one already-encoded frame, and “B frame” encoded while performing motion compensation based on two already-encoded frames.
In order to further improve the efficiency of video image prediction, in H.263 and H.264, decoded images of a plurality of frames (i.e., two frames or more) are stored in a reference image memory, and a reference image is selected from the images of the memory to perform prediction.
The reference image can be selected for each block, and reference image designation information for designating the reference image can be encoded to perform the corresponding decoding.
For “P frame”, one piece of reference image designation information is encoded for each block. For “B frame”, two pieces of reference image designation information elements are encoded for each block.
In motion compensation, in addition to the reference image designation information, a vector for indicating a position in the reference image is encoded, where a target block is encoded by using the position, and the vector is called a “motion vector”. Similar to the reference image designation information, one motion vector is encoded for “P frame”, and two motion vectors are encoded for “B frame”.
In encoding of the motion vector in MPEG-4 or H.264, a predicted vector is generated using a motion vector of a block adjacent to an encoding target block, and only a differential vector between the predicted vector and the motion vector used in motion compensation applied to the target block. In accordance with this method, when motion continuity is present between the relevant adjacent blocks, the motion vector can be encoded with a high level of encoding efficiency.
Non-Patent Document 1 discloses a process of generating a predicted vector in H.264, and the general explanation thereof is presented below.
In H.264, as shown in FIG. 13A, based on motion vectors (mv_a, mv_b, and mv_c) used in a left side block (see “a” in FIG. 13A), an upper side block (see “b” in FIG. 13A), and an upper-right side block (see “c” in FIG. 13A) of an encoding target block, horizontal and vertical components are obtained by computing the median for each direction.
As H.264 employs a variable block size motion compensation, the block size for motion compensation may not be the same between the target block and peripheral blocks thereof. In such a case, as shown in FIG. 13B, block “a” is set to the uppermost block among left side blocks adjacent to the target block, block “b” is set to the leftmost block among upper side blocks adjacent to the target block, and block “c” is set to the closest upper-left block.
As an exception, if the size of the target block is 8×16 pixels, as shown in FIG. 13C, instead of the median, block “a” and block “c” are respectively used for predicting the left and right blocks. Similarly, if the size of the target block is 16×8 pixels, as shown in FIG. 13D, instead of the median, block “a” and block “b” are respectively used for predicting the lower and upper blocks.
As described above, in H.264, a reference frame is selected for each block from among a plurality of already-encoded frames, and is used for motion compensation.
Generally, the motion of the imaged object is not uniform and depends on the reference frame. Therefore, in comparison with a motion vector in motion compensation performed using a reference frame different from that of the target block, a motion vector in motion compensation performed using the same reference frame as the target block should be close to a motion vector used for the target block.
Therefore, in H.264, if there is only one block (among the blocks a, b, and c) whose reference frame is the same as that of the encoding target block, then instead of the median, the motion vector of the relevant block is used as a predicted vector so as to generate a predicted vector having a relatively higher level of reliability.
Next, conventional encoding methods for multi-viewpoint video images will be explained.
Generally, multi-viewpoint video encoding uses a correlation between cameras, and a high level of encoding efficiency is obtained by using “disparity compensation” in which motion compensation is applied to frames which are obtained at the same time by using different cameras.
For example, MPEG-2 Multiview profile or Non-Patent Document 2 employ such a method.
In the method disclosed in Non-Patent Document 2, any one of motion compensation and disparity compensation is selected for each block. That is, one having a higher encoding efficiency is selected for each block, so that both the temporal correlation and the inter-camera correlation can be used. In comparison with a case of using only one type of correlation, a higher encoding efficiency is obtained.
In disparity compensation, in addition to a prediction residual, a disparity vector is also encoded. The disparity vector corresponds to the motion vector for indicating a temporal variation between frames, and indicates a difference between positions on image planes, which are obtained by cameras arranged at different positions, and onto which a single position on the imaged object is projected.
FIG. 14 is a schematic view showing the concept of disparity generated between such cameras. In the schematic view of FIG. 14, image planes of cameras, whose optical axes are parallel to each other, are observed vertically from the upper side thereof.
In the encoding of the disparity vector, similar to the encoding of the motion vector, it is possible that a predicted vector is generated using a disparity vector of a block adjacent to the encoding target block, and only a differential vector between the predicted vector and the disparity vector used in disparity compensation applied to the target block is encoded. In accordance with such a method, when there is disparity continuity between the relevant adjacent blocks, the disparity vector can be encoded with a high level of encoding efficiency.
For each frame in multi-viewpoint video images, temporal redundancy and redundancy between cameras are present at the same time. Non-Patent Document 3 discloses a method for removing both redundancies simultaneously.
In the relevant method, temporal prediction of a differential image between an original image and a disparity-compensated image is performed so as to execute the relevant encoding. That is, after the disparity compensation, a residual of motion compensation in the differential image is encoded.
In accordance with the above method, temporal redundancy, which cannot be removed by a disparity compensation for removing the inter-camera redundancy, can be removed using the motion compensation. Therefore, a prediction residual, which is finally encoded, is reduced, so that a high level of encoding efficiency can be achieved.    Non-Patent Document 1: ITU-T Rec.H.264/ISO/IEC 11496-10, “Editor's Proposed Draft Text Modifications for Joint Video Specification (ITU-T Rec. H.264/ISO/IEC 14496-10 AVC), Draft 7”, Final Committee Draft, Document JVT-E022, pp. 63-64, September 2002.    Non-Patent Document 2: Hideaki Kimata and Masaki Kitahara, “Preliminary results on multiple view video coding (3DAV)”, document M10976 MPEG Redmond Meeting, July 2004.    Non-Patent Document 3: Shinya Shimizu, Masaki Kitahara, Kazuto Kamikura and Yoshiyuki Yashima, “Multi-view Video Coding based on 3-D Warping with Depth Map”, In Proceedings of Picture Coding Symposium 2006, SS3-6, April 2006.