Multi-viewpoint images are images obtained by photographing the same object and background thereof by using a plurality of cameras, and multi-viewpoint video images are video images of the multi-viewpoint images. Below, a video image obtained by a single camera is called a “two-dimensional video image”, and a set of multiple two-dimensional video images obtained by photographing the same object and background thereof is called a “multi-viewpoint video image”.
As there is a strong correlation between two-dimensional video images, the encoding efficiency thereof is improved by using such a correlation. On the other hand, when the cameras for obtaining multi-viewpoint images or multi-viewpoint video images are synchronized with each other, the images (of the cameras) corresponding to the same time have captured the imaged object and background thereof in entirely the same state from different positions, so that there is a strong correlation between the cameras. The encoding efficiency of the multi-viewpoint images or the multi-viewpoint video images can be improved using this correlation.
First, conventional techniques relating to the encoding of two-dimensional video images will be shown.
In many known methods of encoding two-dimensional video images, such as H. 264, MPEG-2, MPEG-4 (which are international encoding standards), and the like, highly efficient encoding is performed by means of motion compensation, orthogonal transformation, quantization, entropy encoding, or the like. For example, in H. 264, encoding can be performed by means of temporal correlation together with a plurality of past or future frames.
For example, Non-Patent Document 1 discloses detailed techniques of motion compensation used in H. 264. General explanations thereof follow.
In accordance with the motion compensation in H. 264, a target frame for encoding can be divided into blocks of any size, and each block can have individual motion vector and reference image. In addition, the reference image is subjected to filtering, so as to generate a video image based on a half or one-fourth pixel position, thereby implementing motion compensation of a finer accuracy of a one-fourth pixel level, and thus implementing encoding having a higher efficiency in comparison with the encoding based on any conventional international encoding standard.
Next, a conventional encoding method of multi-viewpoint images or multi-viewpoint video images will be explained.
The difference between the encoding of multi-viewpoint images and the encoding of multi-viewpoint video images is that multi-viewpoint video images have, not only a correlation between cameras, but also a temporal correlation. However, the same method using the correlation between cameras can be applied to both the multi-viewpoint images and the multi-viewpoint video images. Therefore, methods used in the encoding of multi-viewpoint video images will be explained below.
As the encoding of multi-viewpoint video images uses a correlation between cameras, the multi-viewpoint video images are highly efficiently encoded in a known method which uses “parallax (or disparity) compensation” in which motion compensation is applied to images obtained by different cameras at the same time. Here, “parallax” (or disparity) is the difference between positions, to which the same point on an imaged object is projected, on the image planes of cameras which are disposed at different positions.
FIG. 8 is a schematic view showing the concept of parallax generated between such cameras. In the schematic view of FIG. 8, image planes of cameras, whose optical axes are parallel to each other, are looked down (vertically) from the upper side thereof. Generally, such points, to which the same point on an imaged object is projected, on image planes of different cameras, are called “corresponding points”.
In parallax compensation, based on the above corresponding relationship, each pixel value of a target frame for encoding is predicted using a reference frame, and the relevant prediction residual and parallax data which indicates the corresponding relationship are encoded.
By using camera parameters and the Epipolar geometry constraint, the above corresponding relationship can be represented by a one-dimensional quantity such as a distance from one (as a standard) of the cameras to the imaged object, without using a two-dimensional vector.
FIG. 9 is a schematic view showing the concept of the Epipolar geometry constraint. In accordance with the Epipolar geometry constraint, when a point in an image of a camera corresponds to a point in an image of another camera, the point of another camera is constrained on a straight line called an “Epipolar line”. In such a case, if the distance from the camera to the imaged object is obtained for the relevant pixel, the corresponding point can be determined on the Epipolar line in a one-to-one correspondence manner.
For example, as shown in FIG. 9, a point of the imaged object, which is projected onto the position “m” in an image of camera A, is projected (in an image of camera B) onto (i) the position m′ on the Epipolar line when the corresponding point of the imaged object in the actual space is the position M′, (ii) the position m″ on the Epipolar line when the corresponding point of the imaged object in the actual space is the position M″, and (iii) the position m′″ on the Epipolar line when the corresponding point of the imaged object in the actual space is the position M′″.
FIG. 10 is a diagram for explaining that corresponding points can be obtained between a plurality of cameras when the distance from one of the cameras to the imaged object is provided.
Generally, parallax varies depending on the target frame for encoding, and thus parallax data must be encoded for each target frame. However, the distance from a camera to the imaged object is determined in accordance with physical states of the imaged object, and thus the corresponding points on images of the plurality of cameras can be represented using only data of the distance from a camera to the imaged object.
For example, as shown in FIG. 10, both the corresponding point mb in an image of camera B and the corresponding point mc in an image of camera C, which each correspond to the point ma in an image of camera A, can be represented using only data of the distance from the position of the viewpoint of camera A to the point M on the imaged object.
In accordance with the above characteristics, when the parallax data is represented by the distance from a camera of the relevant reference image to the imaged object, it is possible to implement parallax compensation from the reference image to all frames obtained by other cameras at the same time, where positional relationships between the cameras have been obtained. In Non-Patent Document 2, the number of parallax data items which must be encoded is decreased using the above characteristics, so as to perform highly efficient encoding of multi-viewpoint video images
Non-Patent Document 3 is a prior-art document which discloses a technique referred to in an embodiment (explained later) of the present invention, and explanations relating to parameters for indicating positional relationships between a plurality of cameras, and parameters for indicating data of projection (by a camera) onto an image plane.    Non-Patent Document 1: ITU-T Rec.H.264/ISO/IEC 11496-10, “Editor's Proposed Draft Text Modifications for Joint Video Specification (ITU-T Rec. H.264 /ISO/IEC 14496-10 AVC), Draft 7”, Final Committee Draft, Document JVT-E022, pp. 10-13, and 62-68, September 2002.    Non-Patent Document 2: Shinya SHIMIZU, Masaki KITAHARA, Kazuto KAMIKURA and Yoshiyuki YASHIMA, “Multi-view Video Coding based on 3-D Warping with Depth Map”, In Proceedings of Picture Coding Symposium 2006, SS3-6, April, 2006.    Non-Patent Document 3: Oliver Faugeras, Three-Dimension Computer Vision-MIT Press; BCTC/UFF-006.37 F259 1993-ISBN:0-262-06158-9, pp. 33-68.