Multi-viewpoint video images are video images obtained by photographing the same subject and background thereof by using a plurality of cameras at different positions. Below, a video image obtained by a single camera is called a “two-dimensional video image”, and a set of two-dimensional video images obtained by photographing the same subject and background thereof is called a “multi-viewpoint video image”. There is a strong correlation between two-dimensional video images (of the different cameras) included in the multi-viewpoint video image. If the cameras are synchronized with each other, the frames (of the cameras) corresponding to the same time have captured the subject and background thereof in entirely the same state, so that there is a strong correlation between the cameras.
First, conventional techniques relating to the encoding of two-dimensional video images will be shown. In many known methods of encoding two-dimensional video images, such as H.264, MPEG-4, MPEG-2 (which are international encoding standards), and the like, highly-efficient encoding is performed by means of motion compensation, orthogonal transformation, quantization, variable-length encoding, or the like.
For example, in H.264, each I frame can be encoded by means of intraframe correlation; each P frame can be encoded by means of interframe correlation together with a plurality of past frames; and each B frame can be encoded by means of interframe correlation together with a plurality of past or future frames.
Even though Non-Patent Document 1 discloses the H.264 techniques in detail, the outline thereof will be described below. In each I frame, the frame is divided into blocks (called “macroblocks”, the size of each block is 16×16 (pixels)), and intraframe prediction (intra-prediction) is performed in each macroblock. In intra-prediction, each macroblock is further divided into smaller blocks (called “sub-blocks”, below), and an individual intra-encoding method can be applied to each sub-block.
In each P frame, intra-prediction or inter-prediction (interframe prediction) may be performed in each macroblock. The intra-prediction applied to a P frame is similar to that applied to an I frame. In the inter-prediction, motion compensation is performed. Also in the motion compensation, each macroblock is divided into smaller blocks, and each sub-block may have an individual motion vector and an individual reference image.
Also in each B frame, intra-prediction or inter-prediction can be performed. In the inter-prediction of the B frame, in addition to a past frame, a future frame can be referred to as a reference image in motion compensation. For example, when encoding a frame sequence of “I→B→B→P”, the frames can be encoded in the order of “I→P→B→B”. Also in each B frame, motion compensation can be performed by referring to an I or P frame. Additionally, similar to the P frame, each sub-block (obtained by dividing a macroblock) can have an individual motion vector.
When performing intra or inter-prediction, a prediction residual is obtained. In each macroblock, a prediction-residual block is subjected to DCT (discrete cosine transform), so as to perform quantization. The obtained quantized values of DCT coefficients are then subjected to variable-length encoding.
In a known method for encoding multi-viewpoint video images, the multi-viewpoint video images are highly efficiently encoded by means of “parallax compensation” in which motion compensation is applied to images obtained by different cameras at the same time. Here, “parallax” is the difference between positions, to which the same point on a subject is projected, on an image plane of cameras which are disposed at different positions.
FIG. 9 is a schematic view showing the concept of parallax generated between such cameras. In the schematic view, an image plane of cameras, whose optical axes are parallel to each other, is looked down vertically. Generally, such points, to which the same point on a subject is projected, on an image plane of different cameras, are called “corresponding points”. As parallax can be represented as a positional difference on the relevant image plane, it can be represented as two-dimensional vector data.
In parallax compensation, the corresponding point on an image of a reference camera, which corresponds to a target pixel in an image of a target camera for the relevant encoding, is estimated using a reference image, and the pixel value of the target pixel is predicted by using a pixel value assigned to the corresponding point. Below, such “estimated parallax” is also called “parallax” for convenience of explanation.
Non-Patent Document 2 discloses an encoding method using parallax compensation, and in such a method, parallax data and each prediction residual are encoded with respect to the pixels of a target image to be encoded. More specifically, in the relevant method, parallax compensation is performed for each block as a unit, where such parallax for each unit block is represented using a two-dimensional vector. FIG. 10 is a schematic view showing a parallax vector. That is, in this method, parallax data as a two-dimensional vector and the relevant prediction residual are encoded. As this method does not use camera parameters in encoding, it is effective when camera parameters are unknown.
In addition, when there are a plurality of reference images obtained by different cameras, parallax compensation may be performed using an arbitrary viewpoint image technique. Non-Patent Document 3 discloses parallax compensation using an arbitrary viewpoint image technique. More specifically, each pixel value of an image obtained by a target camera for the relevant encoding is predicted by means of interpolation using the pixel values of corresponding points (belonging to different cameras) which correspond to the relevant pixel. FIG. 11 is a schematic view showing such interpolation. In the interpolation, the value of pixel m in a target image to be encoded is predicted by performing interpolation between pixels m′ and m″ of reference images 1 and 2, where the pixels m′ and m″ correspond to the pixel m.
When there are two or more reference images obtained by different cameras (as disclosed in Patent Document 3), parallax from each pixel of a target image (to be encoded) to each reference image can be estimated without using the target image. FIG. 12 is a schematic view for showing the concept of such parallax estimation.
As shown in FIG. 12, in true parallax, the pixel values of corresponding points between the relevant reference images should be almost identical to each other. Therefore, in many parallax estimation methods, with regard to each of various depths, the pixel values of corresponding points between the reference images are compared with each other, and parallax can be estimated based on the depth which brings the closest pixel values. Such a process can be applied to each pixel of a target image to be encoded.
As described above, when there are two or more reference images obtained by different cameras, and parallax estimation is possible on the decoding side, then parallax compensation can be performed on the decoding side by using parallax data for each pixel, without providing parallax data, which is explicitly encoded on the encoding side, to the decoding side.
Non-Patent Document 1: ITU-T Rec.H.264/ISO/IEC 11496-10, “Advanced Video Coding”, Final Committee Draft, Document JVT-E022, September 2002.
Non-Patent Document 2: Hideaki Kimata and Masaki Kitahara, “Preliminary results on multiple view video coding (3DAV)”, document M10976 MPEG Redmond Meeting, July, 2004.
Non-Patent Document 3: Masayuki Tanimoto, Toshiaki Fujii, “Response to Call for Evidence on Multi-View Video Coding”, document Mxxxxx MPEG Hong Kong Meeting, January, 2005.
In conventional techniques, when there are two or more reference images obtained by different cameras, and parallax estimation is possible on the decoding side, then parallax compensation can be performed on the decoding side by using parallax data for each pixel, without providing parallax data, which is explicitly encoded on the encoding side, to the decoding side. Such parallax with regard to a target image to be encoded or decoded, which can be estimated on the encoding or decoding side without using the target image (for the decoding, without decoding the relevant image), is called “reference parallax”.
However, reference parallax, which is estimated on the decoding side, is not an optimum one in consideration of the prediction efficiency, and the amount of code assigned to the relevant prediction residual may be increased. When parallax for maximizing the prediction efficiency is computed on the encoding side, and the difference (called “parallax displacement” below) between the computed parallax and the reference parallax is encoded for each pixel, the prediction efficiency can be improved, thereby improving the encoding efficiency with respect to the prediction residual.
However, in such a technique which can be easily anticipated, as the parallax displacement is encoded for each pixel, the amount of code of parallax data increases, so that the total encoding efficiency cannot be high.