The area of 3D video (3DTV) is gaining momentum and is touted as the next logical step in consumer electronics, mobile devices, computers and the movies. The additional dimension on top of 2D video offers multiple different directions for displaying the content and improves the potential for interaction between viewers and the content.
The content can be viewed using glasses, e.g. anaglyphic, polarized and shutter, or without glasses, e.g. by using auto-stereoscopic displays. In case of a 2-view auto-stereoscopic display, two slightly different images are shown to the user using a display with a specific optical system such as lenticular lenses or parallax barrier. The viewer needs to position herself in a specific location in front of the device so that different images arrive on her left and right eye respectively, as an “angular cone”. An extension to the auto-stereoscopic display is the n-view auto-stereoscopic displays where multiple viewers can experience the stereo effect without glasses. The content may also be viewed by using a face tracking device or some other means for selecting the proper set of views to display.
Stereoscopic displays with two views, e.g. displays with 3D glasses, typically display two views such that the two views that are being observed by the user correspond to a stereo video pair as captured by a stereo camera with a stereo baseline, i.e. distance between cameras, of 6-7 cm, which corresponds to a typical human eye distance.
Auto-stereoscopic multiview displays present a comparably large number of views from slightly different viewing positions at the same time. Those views are displayed simultaneously at slightly different positions. Thus when a user looks at the autostereoscopic multiview display, he/she will see two different views from the range that is being displayed. The view pairs that the viewer gets to see should be such that they provide a good stereoscopic viewing perception. Typically, a good stereoscopic viewing perception is provided if the two views that are being observed by the user correspond to a stereo video pair as captured by a stereo camera with a stereo baseline of 6-7 cm. Typically, auto-stereoscopic multiview displays display a total range of several, e.g. 4, stereo camera baselines, while at the same time presenting a single stereo baseline when a user looks at the display. Hence the user can move within a defined viewing area, without losing the stereoscopic perception.
As becomes apparent from the description of 2-view displays and auto-stereoscopic multiview displays, the latter require displaying a larger range of views, e.g. 4 stereoscopic baselines, than 2-view displays, which display 1 stereoscopic baseline.
The benefits of 3D video come with extra costs for content production, distribution and management. Firstly, the producer needs to record from additional sources which increase the information for compression, transport, wired or wireless, and storage, e.g. file servers, disks, etc. Additionally, there are physical limitations on how many video sources, views, that can be captured. Usually, the number of cameras is 2 or 3, although there are cases where bigger camera rigs, with up to 80 cameras, have been built. Given the predominance of 2-view stereoscopic displays in 3D cinemas and 3DTVs, almost all 3D content is captured such that it suits 2-view stereoscopic displays, i.e. using 1 stereoscopic baseline during capture.
Moreover, there are two forms of interaction: 1) pre-defined number of existing views, i.e. a finite number, or 2) an arbitrary view, i.e. an infinite number. Case 1 exhibits a jitter effect when we move from one viewing angle to another. This is alleviated in case 2, thanks to synthesis with interpolation or extrapolation of available views.
Among the view synthesis techniques, depth image based rendering (DIBR) has a prominent position. DIBR typically uses two views and their corresponding depth maps. A depth map contains information regarding the distance of objects from the camera and allows for realistic view warping from an existing position into a new one.
Depth maps may be acquired using infra-red depth cameras, computed for computer generated content or derived from one or more texture images (henceforth referred to as textures) using various techniques.
Any system with view synthesis capabilities that relies on a DIBR requires n input views (textures) and m depth maps. Usually n=m≥2. Due to that constraint it is evident that the bit-rate for 3DTV is higher than for 2D TV. To quantify the added cost we need to take into consideration the resolution of the depth maps (usually similar to the resolution of texture) and their spatial and temporal characteristics. The theoretical bit-rate boundaries for 3DV and n=m=2 lies somewhere between 1×-4× the 2D bitrate. But due to the nature of both texture and depth maps the final rate is somewhere between 1.4×-2.5×.
In FIG. 7, two input views, 701 and 702 are used to synthesize a new one, a virtual view or a synthesized view 703. If the synthesized view 703 resulted only from warping the left view 701, then the two gray areas, 704a and 704b, next to the objects in 703 are domains where there is lack of information, so-called disocclusion, i.e. areas which are hidden in the left view, but which appear/are revealed in the synthesized view. In this case, the right view 702 may be used to fill-in the missing details. Otherwise, missing details, i.e. the details that should appear or be disoccluded in a synthesized view, need to be estimated, which can be difficult, e.g. when no information about the missing details is available. This may lead to visual artifacts in the synthesized view.
As becomes apparent from the description above, synthesis of an intermediate view between two views, interpolation, is easier than synthesis of a view left or right from the leftmost or rightmost available view, extrapolation. Extrapolation becomes more difficult the larger the distance of the extrapolated view from the closest reference view, i.e. existing view or input view, used for synthesis.
The above indicates that typical 3D content, produced for 2-view displays, single stereoscopic baseline, has to be extrapolated in order to be displayed on an auto-stereoscopic multiview display, which requires e.g. 4 stereoscopic baselines, which can lead to visual artefacts.
For various reasons the number of input views available for 3DTV needs to be limited. Moreover, in order to achieve the compression ratio mentioned earlier, temporal and spatial redundancies between the textures and depths respectively needs to be removed. This can be achieved in various ways. Multiview video coding (MVC), for example, is capable of reducing spatio-temporal redundancies and also in-between views redundancies. But, some of the redundancies are difficult to eliminate. For example, from the example in FIG. 1, the only part strictly necessary from the right view is the so called “disocclusion area”, i.e. the area which is hidden in the left view, but is revealed/visible in the synthesized view.
MVC and texture+depth formats such as multiview plus depth (MVD) do not address the issue of disocclusions directly. These systems are designed with data compression of multiple views in mind. They are not designed to directly reduce redundancies by detecting disocclusions. In both MVC and MVD the resulting disocclusions are treated as holes that need to be filled from respective other views.
MPEG is working on standardizing a 3D video codec (MPEG 3DV) capable of compressing 3D video in the MVD format. The work is divided into several branches, each branch handling the legacy/backwards capability of existing/approaching codecs. These branches are 3DV-AVC, 3DV-MVC and 3DV-HEVC. HEVC is the next generation 2D video codec expected to take the market shares for the upcoming high quality video services including broadcasted Ultra HDTV with 4K resolution.
In order to facilitate the DIBR view synthesis, a number of parameters need to be signalled for the device or programme module that performs the view synthesis. Among those parameters are first of all z near and z far that represent the closest and the farthest depth values in the depth maps for the frame under consideration. These values are needed in order to map the quantized depth map samples to the real depth values that they represent (one of the formulas below). The upper formula is used if all the depth from the origin of the space are positive or all negative. Otherwise, the formula below is used.
      Z    =          1.0                                    v            255.0                    ·                      (                                          1.0                                  Z                                      near                    ⁢                                                                                                                            -                              1.0                                  Z                  far                                                      )                          +                  1.0                      Z            far                                    Z    =          Tz      +                        1.0                                                    v                255.0                            ·                              (                                                      1.0                                          Z                                              near                        ⁢                                                                                                                                                        -                                      1.0                                          Z                      far                                                                      )                                      +                          1.0                              Z                far                                                    .            
These formulas are used for translating quantized depth value to real depth value. Variable v represents luminance value for each pixel in a grey-scale depth image (for 8-bit depth map, between 0 and 255). Tz represents a z component of a translation vector.
Another set of parameters that is needed for the view synthesis are camera parameters. Camera parameters for the 3D video are usually split into two parts. The first part that is called the intrinsic (internal) camera parameters represents the optical characteristics of the camera for the image taken, such as the focal length, the coordinates of the images principal point and the radial distortion. The extrinsic (external) camera parameters, in their turn represent the camera position and the direction of its optical axis in the chosen real world coordinates (the important aspect here is the position of the cameras relative to each other and the objects in the scene). Both internal and external camera parameters are required in the view synthesis process based on usage of the depth information (such as DIBR).
An alternative solution to sending the key cameras is the layered depth video (LDV) that uses multiple layers for scene representation. These layers can be as of: foreground texture, foreground depth, background texture and background depth.
There exist standardized ways of sending the camera parameters to the decoder. One of them is defined in the multi-view video coding (MVC) standard, which is defined in the annex H of the well-known advanced video coding (AVC) standard, also known as H.264. The scope of MVC covers joint coding of stereo or multiple views representing the scene from several viewpoints. The standard eventually exploits correlation between these views of the same scene in order to achieve better compression efficiency comparing to compressing the views independently. The MVC standard also covers sending the camera parameters information to the decoder. The camera parameters are sent as supplementary enhancement information (SEI) message.
Camera parameters are typically sent in floating point representation. The floating point representation allows to support a higher dynamic range of the parameters and to facilitate sending the camera parameters with higher precision. The higher precision of the camera parameters has been shown to be important for the view synthesis.