The research in three-dimensional (3D) media has gained a lot of momentum in recent years, and there is a lot of interest from industry, academy and consumer society. A number of 3D movies are being produced every year, providing great stereoscopic effects to the spectators. However, this is only a part of the story. Namely, we can already enjoy the 3D experience at home, and in the very near future, we will have 3D-enabled mobile phones as well.
The term 3D is usually connected to stereoscopic experience, where user's eyes are provided with slightly different images of a scene which are further fused by the brain to create a depth impression. However, there is much more to 3D. Free viewpoint television (FTV) is a novel audio-visual system that allows users to have a 3D visual experience while freely changing their position in front of a 3D display. Unlike the typical stereoscopic television, which enables a 3D experience to users that are sitting at a fixed position in front of a screen, FTV allows to observe the scene from many different angles, thus providing a more realistic impression.
The FTV functionality is enabled by multiple components. The 3D scene is captured by many cameras and from different views or angles so-called multiview video. Different camera arrangements are possible, depending on the application. For example, it may be as simple as a parallel camera arrangement on a one-dimensional (1D) line, whereas in more complex scenarios it may include two-dimensional (2D) camera arrays forming an arc structure.
Multiview video can be relatively efficiently encoded by exploiting both temporal and spatial similarities that exist in different views. The first version of multiview video coding (MVC) was standardized in Jul. 30, 2008. However, even with MVC, the transmission cost remains prohibitively high. This is why only a subset of the captured multiple views is actually being transmitted, in combination with additional 3D components.
In order to compensate for the missing information, depth and disparity maps can be used instead. Depth map is a simple grayscale image, wherein each pixel indicates the distance between the corresponding pixel from a video object and the capturing camera. Disparity, on the other hand, is the apparent shift of a pixel which is a consequence of moving from one viewpoint to another. Depth and disparity are mathematically related and can be interchangeably used.
From the multiview video and depth/disparity information we can generate virtual views at an arbitrary viewing position as depicted in FIG. 1. In this way we compensate for the unsent multiview video, but we can also synthesize additional virtual views.
Having good quality depth maps is of crucial importance. Namely, errors in a depth map translate to incorrect shifts of texture pixels in a synthesized view. This is especially visible around object boundaries, where we can see pixels from foreground objects being incorrectly copied to the background, and vice versa. This results in an annoying viewing experience.
Depth maps are usually estimated, and there is a wealth of algorithms available for that purpose in the art. However, the quality of depth maps estimated this way may be far from acceptable. There are some reasons for this. Firstly, pixels in occluded regions, i.e. regions visible in one of the images but not in the other one(s), cannot be correctly estimated. Secondly, images used for depth estimation are always affected by some level of sensor noise, which affects the accuracy of depth maps. Finally, brightness constraints imposed on images used in depth estimation algorithms are difficult to meet in practice.
Alternatively, depth maps can be obtained by specialized cameras, e.g. infrared or time-of-flight (ToF) cameras. This typically gives high quality accurate depth maps. However, ToF cameras are still commercially ill-deployed due to their high cost and incapability to provide competitive resolutions compared to video cameras.
Depth maps may be transmitted with a reduced resolution. Being simpler than the regular video signals, they can be downsampled without too much loss of information. Thus, not only the bitrate is reduced but also a constraint by the display manufacturers is met. This motivates the search for new effective depth upsampling concepts.
Standard image or video upsampling methods such as nearest neighbor, linear, bilinear or bicubic interpolation provide only limited quality results when applied on depth maps. Unlike their common use, where they are applied on textures directly, these filters may introduce incorrect distance information for the pixels. This further causes incorrect shifts of texture pixels in a synthesized view. FIG. 2 illustrates this “smearing” effect, visible all around the foreground object boundaries, where the pixels from the clothes and heads are copied in the background. This may result in a very annoying experience. Thus, the prior art depth and disparity map upsampling methods have significant limitations and can produce undesired smearing effect.
Different solutions have been proposed, like the use of Markov Random Fields (MRF) or joint-bilateral upsampling (JBU). Especially JBU has gained a lot of interest and lead to several extensions, such as a noise-aware filter for depth upsampling (NAFDU), switching between bilateral and joint-bilateral filtering depending on a pre-filtered depth map. However, the use of JBU leads to problems such as texture copying, as depicted in FIG. 3. Moreover, the performance of JBU is parameter dependent. Because of complexity reasons, the parameters are usually chosen on the frame or sequence level. This is clearly suboptimal, since even a single frame may contain both very smooth regions and regions with lots of edges, both requiring different set of parameters.
Thus, there is a need for an efficient upsampling that can be applied to at least depth and/or disparity maps.