A distance image is an image in which the distance from a camera to an object (or subject) is represented by a pixel value. Since the distance from a camera to an object can be defined as the depth of a scene, the distance image is often called a “depth image”. In addition, it is sometimes called a “depth map”. In the technical field of computer graphics, since the depth is information stored in a Z buffer (i.e., a memory region for storing depth values of the entire image), the distance image is often called a “Z image” or a “Z map”. Additionally, instead of the distance from a camera to an object, coordinate values for the Z axis in a three-dimensional coordinate system in space may be used to represent a distance (or depth).
Generally, in an obtained image, the X and Y axes are respectively defined as the horizontal and vertical directions, and the Z axis is defined in the direction of the relevant camera. However, when, for example, a common coordinate system is used between a plurality of cameras, the Z axis may not be defined in the direction of a camera.
Below, distance, depth, and Z values (depth information) are not distinguished from each other, and are commonly called “distance information”. Additionally, an image in which distance information is represented by pixel values is called a “distance image”.
In order to represent distance information by using pixel values, there are three methods: (i) a method in which values corresponding to physical quantities are directly defined as pixel values, (ii) a method that uses values obtained by quantizing a section between the minimum and maximum values into discrete values, and (iii) a method that uses values obtained by quantizing a difference from the minimum value by using a specific step width. When the range for desired representation has a considerable limit, distance information can be highly accurately represented by using additional information such as the minimum value.
In addition, when performing quantization at regular intervals, there are two methods: a first method of directly quantizing physical values, and a second method of quantizing the inverse numbers of physical values. Generally, the inverse number of the distance image is proportional to disparity. Therefore, in order to highly accurately represent the distance information, the former method is often used. Oppositely, in order to highly accurately represent disparity information, the latter method is often used.
Below, regardless of the method of representing the distance image using pixel values or the quantization method, any image as a representative of distance information is called “distance information”.
The distance image may be applied to 3D image. In a generally known 3D image representation, a stereographic image consists of a right-eye image and a left-eye image of an observer. A 3D image may also be represented using an image obtained by a certain camera and a distance image therefor (refer to Non-Patent Document 1 for a detailed explanation thereof).
In order to encode a 3D image represented using a video image at a specific viewpoint and a distance image, the method defined by MPEG-C Part 3 (ISO/IEC 23002-3) can be used (refer to Non-Patent Document 2 for a detailed explanation thereof).
In addition, when such a video and a distance image are obtained for a plurality of viewpoints, a 3D image having a disparity larger than that obtained by a single viewpoint can be represented (refer to Non-Patent Document 3 for a detailed explanation thereof).
Instead of representing the above-described 3D image, the distance image is also used as one of data items for generating a free-viewpoint image by which the observer's viewpoint can be freely shifted without consideration of the camera arrangement. Such a synthetic image obtained by assuming an observation of a scene from a camera other than cameras which are actually used for imaging may be called a “virtual viewpoint image”, where methods for generating the virtual viewpoint image have been actively examined in the technical field of image-based rendering. Non-Patent Document 4 discloses a representative method for generating the virtual viewpoint image based on a multi-viewpoint video and a distance image.
Since a distance image is formed using a single component, it can be regarded as a gray-scale image. Additionally, an object is present continuously in a real space, and thus it cannot instantaneously move to a distant position. Therefore, similar to image signals, the distance image has spatial and temporal correlation. Accordingly, it is possible to efficiently encode a distance image or a distance video by using an image or video encoding method used for encoding an ordinary image or video signal, while removing spatial or temporal redundancy. Actually, in MPEG-C Part 3, distance video image encoding is assumed to be performed by an existing video encoding method.
Below, a known method of encoding an ordinary video signal will be explained.
Since each object generally has spatial and temporal continuity in real space, appearance of the object has high spatial and temporal correlation. In the video signal encoding, an efficient encoding is achieved utilizing such correlation.
More specifically, the video signal of an encoding target block is predicted based on the video signal of a previously-encoded video signal, and only a residual signal thereof is encoded, thereby reducing information which should be encoded and implementing a high degree of encoding efficiency.
As a representative method of predicting a video signal, there are (i) intra frame prediction that spatially generates a predicted signal based on neighbor (or neighboring) blocks, and (ii) motion compensation prediction that estimates movement of an object in accordance with previously-encoded frames obtained at different times, so as to temporally generate a predicted signal.
In addition, in order to utilize spatial correlation and characteristics of human visual systems, a prediction error called a prediction residual signal is transformed into data in a frequency domain by using DCT or the like, so that energy of the residual signal is concentrated into a low-frequency region, thereby the efficient encoding is achieved.
Detailed explanations of each method can be found in international standards for video encoding, such as MPEG-2 or H.264/MPEG-4 AVC (see Non-Patent Document 5).