Multi-viewpoint video images are video images obtained by photographing the same subject and background thereof by using a plurality of cameras at different positions. Below, a video image obtained by a single camera is called a “two-dimensional video image”, and a set of two-dimensional video images obtained by photographing the same subject and background thereof is called a “multi-viewpoint video image”. There is a strong correlation between two-dimensional video images (of the different cameras) included in the multi-viewpoint video image. If the cameras are synchronized with each other, the frames (of the cameras) corresponding to the same time have captured the subject and background thereof in entirely the same state, so that there is a strong correlation between the cameras.
First, conventional techniques relating to the encoding of two-dimensional video images will be shown. In many known methods of encoding two-dimensional video images, such as H. 264, MPEG-4, MPEG-2 (which are international encoding standards), and the like, highly-efficient encoding is performed by means of motion compensation, orthogonal transformation, quantization, entropy encoding, or the like. For example, in H.264, each I frame can be encoded by means of intraframe correlation; each P frame can be encoded by means of interframe correlation together with a plurality of past frames; and each B frame can be encoded by means of interframe correlation together with a plurality of past or future frames.
Even though Non-Patent Document 1 discloses the H.264 techniques in detail, the outline thereof will be described below. In each I frame, the frame is divided into blocks (called “macroblocks”, the size of each block is 16×16 (pixels)), and intraframe prediction (intra-prediction) is performed in each macroblock. In intra-prediction, each macroblock is further divided into smaller blocks (called “sub-blocks”, below), and an individual intra-encoding method can be applied to each sub-block.
In each P frame, intra-prediction or inter-prediction (interframe prediction) may be performed in each macroblock. The intra-prediction applied to a P frame is similar to that applied to an I frame. In the inter-prediction, motion compensation is performed. Also in the motion compensation, each macroblock is divided into smaller blocks, and each sub-block may have an individual motion vector and an individual reference image.
Also in each B frame, intra-prediction or inter-prediction can be performed. In the inter-prediction of the B frame, in addition to a past frame, a future frame can be referred to as a reference image in motion compensation. For example, when encoding a frame sequence of “I→B→B→P”, the frames can be encoded in the order of “I→P→B→B”. Also in each B frame, motion compensation can be performed by referring to an I or P frame. Additionally, similar to the P frame, each sub-block (obtained by dividing a macroblock) can have an individual motion vector.
When performing intra or inter-prediction, a prediction residual is obtained. In each macroblock, a prediction-residual block is subjected to DCT (discrete cosine transform), so as to perform quantization. The obtained quantized values of DCT coefficients are then subjected to variable-length encoding. In each P frame or B frame, the reference image can be selected for each sub-block, and is indicated by a numerical value called a “reference image index”, and is subjected to variable-length encoding. In H.264, the smaller the reference image index, the shorter the code used in the variable-length encoding. Therefore, in H.264, the reference image index is explicitly varied for each frame. Accordingly, the higher the frequency of use of a reference image, the smaller the reference image index assigned to the reference image, thereby efficiently encoding the reference image index.
In a known method for encoding multi-viewpoint video images, the multi-viewpoint video images are highly efficiently encoded by means of “parallax compensation” in which motion compensation is applied to images obtained by different cameras at the same time. Here, “parallax” is the difference between positions, to which the same point on a subject is projected, on an image plane of cameras which are disposed at different positions.
FIG. 13 is a schematic view showing the concept of parallax generated between such cameras. In the schematic view, an image plane of cameras, whose optical axes are parallel to each other, is looked down vertically. Generally, such points, to which the same point on a subject is projected, on an image plane of different cameras, are called “corresponding points”. In parallax compensation, the corresponding point on an image of a reference camera, which corresponds to a target pixel in an image of a target camera for the relevant encoding, is estimated using a reference image, and the pixel value of the target pixel is predicted by using a pixel value assigned to the corresponding point. Below, such “estimated parallax” is also called “parallax” for convenience of explanation, and in such a method, parallax data and each prediction residual are encoded.
In many methods, parallax is represented by a vector (i.e., parallax (or disparity) vector) in an image plane. For example, in the method disclosed by Non-Patent Document 2, parallax compensation is performed for each block as a unit, where such parallax for each unit block is represented using a two-dimensional vector, that is, by using two parameters (i.e., x component and y component). FIG. 14 is a schematic view showing a parallax vector. That is, in this method, parallax data formed by two parameters and the relevant prediction residual are encoded. As this method does not use camera parameters in encoding, it is effective when camera parameters are unknown.
On the other hand, Non-Patent Document 3 discloses a method of encoding multi-viewpoint images (i.e., static images). In the method, camera parameters are used for encoding, and each parallax vector is represented by one-dimensional data based on the Epipolar geometry constraint, thereby efficiently encoding multi-viewpoint images.
FIG. 15 is a schematic view showing the concept of the Epipolar geometry constraint. In accordance with the Epipolar geometry constraint, in two images obtained by two cameras (i.e., “camera 1” and “camera 2”), point m′ (assigned to point M on a subject) in one of the images, which corresponds to point m in the other image, is constrained on a straight line called an “Epipolar line”. In the method of Non-Patent Document 3, parallax with respect to each reference image is represented using one parameter, that is, the position on a one-dimensional Epipolar line. That is, in this method, parallax data, which is represented by a single parameter, and the relevant prediction residual are encoded.
Even though there are two or more reference images (obtained by different cameras), parallax for each reference image can be represented using a single parameter by means of the Epipolar geometry constraint. For example, when the parallax on the Epipolar line for a reference image is known, then parallax for a reference image obtained by another camera can be reconstituted.
In addition, when there are a plurality of reference images obtained by different cameras, parallax compensation may be performed using an arbitrary viewpoint image technique. Non-Patent Document 4 discloses parallax compensation using an arbitrary viewpoint image technique. More specifically, each pixel value of an image obtained by a target camera for the relevant encoding is predicted by means of interpolation using the pixel values of corresponding points (belonging to different cameras) which correspond to the relevant pixel. FIG. 16 is a schematic view showing such interpolation. In the interpolation, the value of pixel m in a target image to be encoded is predicted by performing interpolation between pixels m′ and m″ of reference images 1 and 2, where the pixels m′ and m″ correspond to the pixel m.    Non-Patent Document 1: ITU-T Rec.H.264/ISO/IEC 11496-10, “Advanced Video Coding”, Final Committee Draft, Document JVT-E022, September 2002.    Non-Patent Document 2: Hideaki Kimata and Masaki Kitahara, “Preliminary results on multiple view video coding (3DAV)”, document M10976 MPEG Redmond Meeting, July, 2004.    Non-Patent Document 3: Koichi Hata, Minoru Etoh, and Kunihiro Chihara, “Coding of Multi-Viewpoint Images” IEICE transactions, Vol. J82-D-II, No. 11, pp. 1921-1929 (1999)    Non-Patent Document 4: Masayuki Tanimoto, Toshiaki Fujii, “Response to Call for Evidence on Multi-View Video Coding”, document Mxxxxx MPEG Hong Kong Meeting, January, 2005.
In conventional methods of encoding multi-viewpoint video images, when the camera parameters are known, parallax data of each reference image can be represented by a single parameter regardless of the number of reference images, by means of the Epipolar geometry constraint, thereby efficiently encoding the parallax data.
However, when a multi-viewpoint video image obtained by actual cameras is a target image to be encoded, and parallax compensation is performed by constraining parallax on an Epipolar line, then prediction efficiency may be degraded due to an error in measured camera parameters. In addition, as each reference image includes a distortion due to encoding, the prediction efficiency may also be degraded when performing parallax compensation by constraining parallax on an Epipolar line. Such degradation in the prediction efficiency causes an increase in the amount of code of the relevant prediction residual, so that the total encoding efficiency is degraded.