Conventionally, multi-view images each including a plurality of images obtained by photographing the same object and background using a plurality of cameras are known. A moving image captured by the plurality of cameras is referred to as a multi-view moving image (multi-view video). In the following description, an image (moving image) captured by one camera is referred to as a “two-dimensional image (moving image),” and a group of two-dimensional images (two-dimensional moving images) obtained by photographing the same object and background using a plurality of cameras differing in a position and/or direction (hereinafter referred to as a view) is referred to as a “multi-view image (multi-view moving image).”
A two-dimensional moving image has a high correlation in relation to a time direction and coding efficiency can be improved using the correlation. On the other hand, when cameras are synchronized, frames (images) corresponding to the same time of videos of the cameras in a multi-view image or a multi-view moving image are frames (images) obtained by photographing the object and background in completely the same state from different positions, and thus there is a high correlation between the cameras (between different two-dimensional images of the same time). It is possible to improve coding efficiency by using the correlation in coding of a multi-view image or a multi-view moving image.
Here, conventional technology relating to encoding technology of two-dimensional moving images will be described. In many conventional two-dimensional moving-image encoding schemes including H.264, MPEG)-2, and MPEG-4, which are international coding standards, highly efficient encoding is performed using technologies of motion-compensated prediction, orthogonal transform, quantization, and entropy encoding. For example, in H.264, encoding using a temporal correlation with a plurality of past or future frames is possible.
Details of the motion-compensated prediction technology used in H.264, for example, are disclosed in Non-Patent Document 1. An outline of the motion-compensated prediction technology used in H.264 will be described. The motion-compensated prediction of H.264 enables an encoding target frame to be divided into blocks of various sizes and enables the blocks to have different motion vectors and different reference images. Using a different motion vector in each block, highly precise prediction which compensates for a different motion of a different object is realized. On the other hand, prediction having high precision considering occlusion caused by a temporal change is realized using a different reference frame in each block.
Next, a conventional encoding scheme for multi-view images or multi-view moving images will be described. A difference between the multi-view image coding scheme and the multi-view moving-image coding scheme is that a correlation in the time direction is simultaneously present in a multi-view moving image in addition to the correlation between the cameras. However, the same method using the correlation between the cameras can be used in both cases. Therefore, a method to be used in coding multi-view moving images will be described here.
In order to use the correlation between the cameras in coding of multi-view moving images, there is a conventional scheme of encoding a multi-view moving image with high efficiency through “disparity-compensated prediction” in which the motion-compensated prediction is applied to images captured by different cameras at the same time. Here, the disparity is a difference between positions at which the same portion on an object is present on image planes of cameras arranged at different positions. FIG. 15 is a conceptual diagram illustrating the disparity occurring between the cameras. In the conceptual diagram illustrated in FIG. 15, image planes of cameras having parallel optical axes face down vertically. In this manner, the positions at which the same portion on the object are projected on the image planes of the different cameras are generally referred to as a corresponding point.
In the disparity-compensated prediction, each pixel value of an encoding target frame is predicted from a reference frame based on the corresponding relationship, and a prediction residual thereof and disparity information representing the corresponding relationship are encoded. Because the disparity varies for every pair of target cameras and positions of the target cameras, it is necessary to encode disparity information for each region in which the disparity-compensated prediction is performed. Actually, in the multi-view moving-image coding scheme of H.264, a vector representing the disparity information is encoded for each block using the disparity-compensated prediction.
The corresponding relationship provided by the disparity information can be represented as a one-dimensional amount representing a three-dimensional position of an object, rather than a two-dimensional vector, based on epipolar geometric constraints by using camera parameters. Although there are various representations as information representing a three-dimensional position of the object, the distance from a reference camera to the object or a coordinate value on an axis which is not parallel to an image plane of the camera is normally used. The reciprocal of the distance may be used instead of the distance. In addition, because the reciprocal of the distance is information proportional to the disparity, two reference cameras may be set and a three-dimensional position may be represented as the amount of disparity between images captured by the cameras. Because there is no essential difference regardless of what expression is used, information representing three-dimensional positions is hereinafter expressed as a depth without such expressions being distinguished.
FIG. 16 is a conceptual diagram of epipolar geometric constraints. According to the epipolar geometric constraints, a point on an image of another camera corresponding to a point on an image of a certain camera is constrained to a straight line called an epipolar line. At this time, when a depth for a pixel of the image is obtained, a corresponding point is uniquely defined on the epipolar line. For example, as illustrated in FIG. 16, a corresponding point in an image of a second camera for the object projected at a position m in an image of a first camera is projected at a position m′ on the epipolar line when the position of the object in a real space is M′ and projected at a position m″ on the epipolar line when the position of the object in the real space is M″.
In Non-Patent Document 2, a highly precise predicted image is generated and efficient multi-view moving-image coding is realized by using this property and synthesizing a predicted image for an encoding target frame from a reference frame in accordance with three-dimensional information of each object given by a depth map (distance image) for the reference frame. Also, the predicted image generated based on the depth is referred to as a view-synthesized image, a view-interpolated image, or a disparity-compensated image.
Further, in Patent Document 1, it is possible to generate a view-synthesized image only for a necessary region by initially converting a depth map for a reference frame into a depth map for an encoding target frame and obtaining a corresponding point using the converted depth map. Thereby, when the image or moving image is encoded or decoded while a method of generating the predicted image is switched for every region of a frame serving as an encoding or decoding target, a processing amount for generating the view-synthesized image or a memory amount for temporarily accumulating the view-synthesized image is reduced.