1. the Field of the Invention
The present invention relates to a method and a device for generating, storing, transmitting, receiving and reproducing depth maps by using the color components of an image belonging to a three-dimensional video stream.
2. The Relevant Technology
The development of stereoscopic video applications largely depends on the availability of efficient formats for representing and compressing the three-dimensional video signal. Moreover, in television broadcast applications (3D-TV) it is necessary to maintain the highest possible degree of backward compatibility with existing 2D systems.
For distribution (or transmission), the currently most widespread technical solutions are based on the so-called “frame compatible arrangement”, wherein the two stereoscopic views, relating to the same time instant, are re-scaled and composed to form a single image compatible with the existing formats. Among these solutions, the top-and-bottom, side-by-side and tile formats are known. These solutions allow using the entire existing video signal distribution infrastructure (terrestrial, satellite or cable broadcasting, or streaming over IP network), and do not require new standards for compression of the video stream. In addition, the current AVC/H.264 coding standard (Advanced Video Coding) and the future HEVC standard (High Efficiency Video Coding) already include the possibility of signalling this type of organization to allow for proper reconstruction and visualization by the receiver.
For display, the two currently most widespread technical solutions are based either on the “frame alternate” principle (i.e., the two views are presented in time succession on the screen) or on the “line alternate” principle, i.e., the two views are arranged on the screen with alternate rows (i.e., they are “interlaced”). In both cases, for each eye to receive the corresponding view, the spectator needs to use a pair of glasses, which may be either “active” ones, i.e., shutter glasses, in the frame alternate case, or “passive” ones, i.e., with differently polarized lenses, in the line alternate case.
The future of three-dimensional visualization will be determined by the diffusion of new self-stereoscopic screens that do not require the user to wear any glasses, whether passive or active ones. These 3D display devices, which are currently still at the prototypal stage, are based on the use of parallax lenses or barriers that can cause the viewer to perceive two different stereoscopic views for each viewpoint that the user may be at when moving angularly around the screen. Therefore, these devices can improve the 3D vision experience, but they require the generation of a large number of views (some tens of them).
As regards 3D video representation, managing the production and distribution of a large number of views is a very exacting task. In recent years, the scientific community has evaluated the possibility of creating an arbitrarily large number of intermediate views by using known Depth Image Based Rendering (DIBR) techniques, which exploit the so-called scene depth map. These formats are also known as “Video+Depth” (V+D), wherein each view is accompanied by a dense depth map. A dense depth map is an image in which each pixel in planar coordinates (x,y), i.e., column, row, represents a depth value (z) corresponding to the pixel of the respective view having the same coordinates. The values of the depth maps can be calculated by starting from the two views obtained by a stereoscopic video camera, or else they can be measured by suitable sensors. Such values are generally represented by using images with 256 grayscale levels, which are compressed by using standard techniques. The Depth Image Based Rendering techniques exploit the fact that, given the coordinates (x,y,z), i.e., the position in the depth plane plus the depth associated with each pixel, it is possible to re-project the pixel onto another image plane relating to a new viewpoint. The most widespread application context is that of a system of stereoscopic video cameras, wherein the two video cameras are positioned horizontally at a distance b between their two optical centres, with parallel optical axes and co-planar image planes. In such a configuration, there is simple relation between the depth z, associated with one pixel, and the so-called disparity d, i.e., the horizontal translation that must be applied to a pixel of the image of the right (or left) video camera in order to obtain the corresponding position in the image plane of the left (or right) video camera. Disparity may be either positive or negative (translation to the left or to the right), depending on the video camera taken into account. With f indicating the focal length of the two video cameras, the following relation between depth z and disparity d exists:d=fb/z. 
For further details, see article: Paradiso, V.; Lucenteforte, M.; Grangetto, M., “A novel interpolation method for 3D view synthesis,” 3DTV-Conference: The True Vision—Capture, Transmission and Display of 3D Video (3DTV-CON), 2012, vol., no., pp. 1, 4, 15-17 Oct. 2012.
Since according to the above-described hypotheses disparity is a simple function of depth, the depth map and the disparity map carry the same information and are therefore interchangeable. In addition, it must be pointed out that the images which are referred to as depth maps within the MPEG context represent the values of 1/z, as opposed to z, mapped in the 0-255 interval. In the following, the term “depth map” will only be used to indicate any representation of depth or disparity.
It should be noted that the video signal made up of a pair of (left and right) images and the respective depth maps has also been chosen as a use case by the MPEG standardization committee for evaluating the techniques that will be introduced in the future 3D coding standards.
This leads to the need for efficiently managing the storage, transmission, reception and reproduction of television signals comprising depth maps.