The area of three-dimensional (3D) video, also referred to as 3DTV, is gaining momentum and is touted as the next logical step in consumer electronics, mobile devices, computers and the movies. The additional dimension on top of 2D video offers multiple different directions for displaying the content and improves the potential for interaction between viewers and the content.
The content may be viewed using glasses (anaglyphic, polarized and shutter) or without glassed using auto-stereoscopic displays. In case of a 2-view autostereoscopic display, two slightly different images are shown to the user using a display with a specific optical system such as lenticular lenses or parallax barrier. The viewer needs to position herself in a specific location in front of the device so that different images arrive on her left and right eye respectively (angular cone). An extension to the auto-stereoscopic display is the n-view auto-stereoscopic displays where multiple viewers may experience the stereo effect without glasses.
The benefits of 3D video come with extra costs for content production, distribution and management. Firstly, the producer needs to record from additional sources which increases the information for compression, transport (wired or wireless) and storage (file servers, disks, etc). Additionally there are physical limitation on how many video sources (views) may be captured. Usually the number of cameras, or set of cameras, is 2 or 3 although there are cases where bigger camera rigs have been built (e.g. up to 80).
Moreover, there are two forms of interaction: Case 1) pre-defined number of existing views (finite number), or Case 2) an arbitrary view (infinite). Case 1 exhibits a jitter effect when we move from one viewing angle to another. This is alleviated in case 2 thanks to synthesis with interpolated or extrapolation of available views.
Among the view synthesis techniques, Depth Image Based Rendering (DIBR) has a prominent position. DIBR typically uses two views and their corresponding depth maps. A depth map contains information regarding the distance of objects from the camera and allows for realistic view warping from an existing position into a new one.
Any system with view synthesis capabilities that relies on a DIBR requires n views (textures) and m depth maps. Usually n=m≧2. Due to that constraint it is evident that the bit-rate for 3DTV is higher than 2D TV. To quantify the added cost we need to take into consideration the resolution of the depth maps (usually similar to the resolution of texture) and their spatial and temporal characteristics.
In FIG. 1, two views are used to synthesise a new one. If the synthesised view resulted only from warping the left view then the two grey areas next to the objects are domains where there is lack of information, also referred to as dis-occlusion. In this case the right view is used to fill-in the missing details.
For various reasons the number of input views available for 3DTV needs to be limited. Moreover, in order to achieve the compression ratio mentioned earlier temporal and spatial redundancies between the textures and depths respectively needs to be removed. This may be achieved in various ways. Multiview Video Coding (MVC) for example is capable of reducing spatio-temporal and also in-between views redundancies. But some of the redundancies are difficult to be eliminated. For example, from FIG. 1, the only part strictly necessary from the right view is the dis-occlusion area.
MVC and image+depth formats such as Multiview plus Depth (MVD) do not really address the issue of dis-occlusions directly. These systems are designed with data compression in mind from multiple views. They are not designed to directly reduce redundancies by detecting dis-occlusions.
A solution to data redundancy comes from Layered Depth Video (LDV) that uses multiple layers for scene representation. The layers are: texture, depth, dis-occlusion texture and dis-occlusion depth. In LDV, the way the layers are created can give rise to the cardboard effect where different objects in the scene give the impression of being flat and there are arbitrary transitions in their edges. Similarly with MVC and MVD, the depth discontinuities between foreground and background distort the objects and their background in synthesised views.
There are some variants of LDV, such as described in WO 2009/001255 A1, the amount of data may be reduced by filtering out redundant parts of the dis-occlusion map. But in that case the dis-occlusion map is difficult to be estimated and is anchored to the central view thus making synthesis of adjacent views problematic. Moreover distortions in discontinuity areas are still present.
An extension to LDV is the Depth Enhanced Stereo (DES) which is two LDV streams for left and right view. DES has increased bitrate over LDV and added complexity due to the layered nature.
Another approach is the LDV plus right view which provides additional dis-occlusion information but at the cost of redundant information on top of LDV and additional complexity.
Existing solutions for synthesising images at the decoder side have serious drawbacks, when it comes to quality of certain image parts at the particular virtual camera positions. Thus it is desired to find a way to improve the quality of synthesised images at the decoder side.