The research in 3D has gained considerable momentum in recent years, and there is a lot of interest from industry, academy and consumer society. Several 3D movies are being produced every year, providing compelling stereoscopic effects to its audience. It is however already possible to enjoy 3D experience at home, and in the very near future, mobile phones will be 3D-enabled.
The term 3D is usually connected to stereoscopic experience, where user's eyes are provided with slightly different images of a scene which are fused by the brain to create depth impression. However, there is much more to 3D. For example, free viewpoint television (FTV) is a TV system that allows users to have a 3D visual experience while freely changing their position in front of a 3D display. Unlike the typical stereoscopic television, which enables a 3D experience to users that are sitting at a fixed position in front of a screen, FTV allows to observe the scene from many different angles, as if we were there.
The FTV functionality is enabled by multiple components. The 3D scene is captured by many cameras and from different views or angles—the so-called multiview video. Different camera arrangements are possible, depending on the application. For example, it may be as simple as a parallel camera arrangement on a 1D line, whereas in more complex scenarios it may include 2D camera arrays forming an arc structure.
Multiview video can be relatively efficiently encoded by exploiting both temporal and spatial similarities that exist in different views. The first version of multiview video coding (MVC) was standardized in July 2008. (MVC is an extension to H.264/AVC.) However, even with MVC, the transmission cost remains prohibitively high. This is why only a subset of the captured multiple views is actually being transmitted. To compensate for the missing information, depth and disparity maps can be used instead. A depth map is a simple greyscale image, wherein each pixel of the map indicates the distance between the corresponding pixel from a video object and the capturing camera. Disparity, on the other hand, is the apparent shift of a pixel which is a consequence of moving from one viewpoint to another. Depth and disparity are mathematically related. The main property of depth/disparity maps is that they contain large smooth surfaces of constant grey levels. This makes them much easier to compress with current video coding technology.
From the multiview video and depth/disparity information it is possible to generate virtual views at an arbitrary viewing position. This can be done by e.g. projection. A view synthesized from texture and depth usually has some pixels unassigned which usually are called holes. This can happen due to rounding errors, and in that case the holes can usually be easily fixed by e.g. median filtering. Another reason is that some pixels/regions in the virtual view may not be visible in the existing view(s) and vice versa. These pixels/regions are called either occluded or disoccluded regions respectively. They can be used in addition to texture and depth, to improve the quality of the synthesized view.
Hence, texture, depth maps, disparity maps and occlusions referred herein as to 3D components, are used to enable the FTV functionality. Alternatively, they can be used to build a 3D model of a scene etc. The main problem that arises in practice is that these 3D components are rarely perfectly consistent. For example, the colors in multiview textures can be slightly unbalanced, which may create an annoying stereo impression.
The problem gets even more evident for depth/disparity/occlusion maps, which are usually estimated rather than measured, due to the cost of the measuring equipment. Thus, in addition to inconsistency, these 3D components often suffer from a poor or at least unacceptable quality. There is a wealth of depth/disparity estimation algorithms in the literature, but they still suffer from many problems such as noise, temporal or spatial inconsistency and incapability to estimate depth/disparity for uniform texture regions etc. Even the measured depth maps can be noisy or may fail on dark objects in the scene. This is the problem with infrared cameras for example, where the dark regions absorb most of the light.
It is clear that inconsistent and poor quality 3D components create many artifacts in rendered views of a 3D scene, leading to unacceptable quality in 3D experience. For example, using inconsistent depth maps in view synthesis creates ghost images, which are especially visible at object boundaries. This is called ghosting. On the other hand, depth map(s) may be temporally unstable, which leads to flickering in the synthesized view. These are only some of the examples which make the stereo impression annoying.
In WO2011/129735, a method for improving 3D representation was proposed. That was achieved by combining multiple available 3D components, captured at different views.
The available 3D components which are denoted s1, . . . , sN, N≧3, are captured at positions v1, . . . , vN in a common (or global) coordinate system, were projected to a virtual position vF in a given common coordinate system, resulting in p1, . . . , pN. This is depicted in FIG. 1 for N=4. The 3D components are exemplified as texture (image/video), depth (range) data, disparity map, occlusion data or any other form which can describe a 3D scene. Projection can be done, for example, with an image warping algorithm. Then the projections p1-pN are segmented. The segments can be single pixels, groups of pixels, regular square or rectangular blocks, irregular areas, foreground/background objects etc. The projected components can also be transformed to another space, before a distance function is applied to pairs of projected components. For each segment k, a distance matrix Dk=[Dk]ij was defined reflecting the similarity of the projected values of each segment at the virtual position vF. I.e. segment k captured at view i is compared with segment k captured at view j, which implies that segments from the captured views are compared and inserted in the distance matrix Dk. The calculated distances [Dk]ij may be compared to a set of given threshold values, Tk=[Tij]k, where k,i,j ε {1,2, . . . , N}. Based on the level of consistencies of the 3D components as well as which and how many 3D components are consistent, a unique 3D representation at the virtual position can be determined which will be used when representing the 3D scene. Instructions on how to calculate this unique representation, based on e.g. the consistency are described in WO2011/129735.