A distance image is an image obtained by expressing a distance from a camera to an object as pixel values. Since the distance from the camera to the object may be called a depth of a scene, the distance image is also called a depth image. Furthermore, since the depth is also called depth, it can also be called a depth map. In the field of computer graphics, since the depth indicates information stored in a Z buffer, there is also a Z image or a Z map. Moreover, in addition to the distance from the camera to the object, a coordinate value for a Z axis of a three-dimensional coordinate system provided in a space to be expressed is also utilized as a distance (a depth). Since a horizontal direction is set as an X axis and a vertical direction is set as a Y axis with respect to a generally captured image, a Z axis coincides with an orientation of a camera. However, as with the case in which a common coordinate system is used for a plurality of cameras, there are cases in which a Z axis does not coincide with an orientation of a camera. Hereinafter, the distance, the depth, and the Z value are called distance information without distinguishing them from one another, and an image obtained by expressing the distance information as pixel values is called a distance image.
The distance information is expressed as pixel values using a method in which values corresponding to quantization amounts are used as pixel values as is, a method in which values obtained by quantizing between a minimum value and a maximum value into a certain value are used, or a method in which values obtained by quantizing differences from the minimum value by a certain step width are used. When a range to be expressed is limited, additional information such as the minimum value is used, so that it is possible to express the distance information with high accuracy.
Furthermore, quantization is performed at a regular interval using a method in which a physical amount is quantized as is, or a method in which a reciprocal of the physical amount is quantized. In general, since a reciprocal of distance information is a value proportional to parallax, the former method is mainly used when it is necessary to express the distance information with high accuracy, and the latter method is mainly used when it is necessary to express parallax information with high accuracy. Hereinafter, regardless of the method for expressing the distance information as the pixel values and the quantization method, all images obtained by expressing the distance information as images are called distance images.
One of the purposes for which the distance images are used is a 3D image. The term general stereoscopic image relates to a stereo image including an image for the right eye and an image for the left eye of an observer. However, it is possible to express a 3D image using an image in a certain camera and a distance image thereof (for more details, refer to Non-Patent Document 1).
In a scheme for encoding a 3D image expressed using a video and a distance image in one viewpoint, MPEG-C Part.3 (ISO/IEC 23002-3) is available (for more details, refer to Non-Patent Document 2).
The video and the distance image are held for multiple viewpoints so that it is possible to express a 3D image having larger parallax than in a 3D image that may be expressed from a single viewpoint (for more details, refer to Non-Patent Document 3).
Furthermore, in addition to expressing such a 3D image, the distance image is also used as data for generating a free viewpoint image for which a viewer is able to freely move a viewpoint without regard to the arrangement of a photographic camera. A synthesized image when viewing a scene from a camera separately from such a photographic camera is called a virtual viewpoint image, and a generation method thereof is actively discussed in the field of image-based rendering. A representative technique for generating a virtual viewpoint image from multi-viewpoint video and a distance image is disclosed in Non-Patent Document 4.
Since the distance image is configured by one component, it may be regarded as a grayscale image. Furthermore, since objects continuously exist in real space and are not able to instantaneously move to a distant position, it may be said that the object has a spatial correlation and a temporal correlation similarly to an image signal.
Consequently, by an image-encoding scheme or a moving image-encoding scheme used in order to encode a normal image signal or video signal, it is possible to efficiently encode a distance image and a distance moving image while removing spatial redundancy and temporal redundancy. Actually, in the MPEG-C Part.3, encoding is performed using an existing moving image encoding scheme.
Hereinafter, a conventional general video signal-encoding scheme will be described. In general, since an object has spatial and temporal continuity in real space, the appearance of the object has a high correlation spatially and temporally. In the encoding of a video signal, high encoding efficiency is achieved utilizing the correlation.
In detail, a video signal of an encoding target block is predicted from an already encoded video signal and only a prediction residue thereof is encoded, so that the amount of information needed to be encoded is reduced, resulting in the achievement of high encoding efficiency. As a representative prediction technique of a video signal, there are intra-frame prediction in which a predicted signal is spatially generated from adjacent blocks, and motion-compensated prediction in which the motion of an object is estimated from encoded frames captured at different times and a predicted signal is temporally generated.
Furthermore, in order to utilize a spatial correlation and characteristics of human visual systems, a prediction error called a prediction residual signal is transformed into data in a frequency domain using a DCT or the like, so that energy of the residual signal is concentrated on a low frequency region, thereby efficient encoding is performed. For more details, refer to MPEG-2, H.264, and MPEG-4 AVC (Non-Patent Document 5), which are international standards for moving picture coding.