The research in 3D video has gained a lot of momentum in recent years, and there is a lot of interest from industry, academy and consumer society. A number of 3D movies are being produced every year, providing great stereoscopic effects to the spectators. However, this is only a part of the story. Namely, we can already enjoy the 3D experience at home, and in the very near future, we will have 3D-enabled mobile phones as well.
The term “3D” is usually connected to stereoscopic experience, where the eyes of a user are provided with slightly different images of a scene which are further fused by the brain to create a depth impression. However, there is much more to 3D. Free viewpoint television (FTV) is a novel audio-visual system that allows users to have a 3D visual experience while freely changing their position in front of a 3D display. Unlike the typical stereoscopic television, which enables a 3D experience to users that are sitting at a fixed position in front of a screen, FTV allows users to observe the scene from many different angles, as if they were there.
The FTV functionality is enabled by multiple components. The 3D scene is captured by many cameras and from different views (angles) which is also referred to as “multiview video”. Different camera arrangements are possible, depending on the application. For example, the camera arrangement may be relatively simple, comprising a parallel set of cameras on a 1D line, or, in more complex scenarios, it may include e.g. 2D camera arrays forming an arc structure. Multiview video is almost with no exception considered in combination with other 3D scene components, such as depth map, disocclusion map or similar. The main reason for that is the transmission cost of the huge amount of data that the multiview video carries. For example, if transmitting a subset, e.g. 2-3, of the views of a whole multiview video acquired together with their depth maps, other views may be reconstructed at a receiver, based on this information. The required bandwidth is then significantly reduced, as compared to if all views would be transmitted.
Multiview video can be relatively efficiently encoded by exploiting both temporal and spatial similarities that exist in different views. The first version of multiview video coding (MVC) was standardized in July 2008. However, even with MVC, the transmission cost remains prohibitively high. This is why often only a subset of captures multiple views is actually being transmitted. To compensate for the missing information, depth and disparity maps can be used instead. A depth map is a simple grayscale image, wherein each pixel indicates the distance between the corresponding pixel from a video object and the capturing camera. Disparity, on the other hand, is the apparent shift of a pixel which is a consequence of moving from one viewpoint to another. Depth and disparity are mathematically related and can be interchangeably used. The main property of depth/disparity maps is that they contain large smooth surfaces of constant grey levels. This makes them much easier to compress with current video coding technology, as compared to regular video images.
Henceforth in this description, the terms “depth” and “depth map” will be used for simplicity reasons. However, it should be noted, and would be clear to a person skilled in the art, that the technical solution described herein applies also to disparity and disparity maps.
From the multiview video and depth information a virtual view at an arbitrary viewing position can be generated, as depicted in FIG. 1a. 
Having good quality depth maps is of crucial importance for the quality of generated or reconstructed 3D views. For example, errors in a depth map translate to incorrect shifts of texture pixels in a synthesized view. This is especially visible around object boundaries, where pixels from foreground objects may be incorrectly copied to the background, and vice versa. This results in a very annoying experience for a viewer of the 3D video.
Depth maps are usually estimated, and there is a wealth of algorithms available for that purpose. However, the quality of depth maps estimated this way is still far from acceptable. There are number of reasons for this. Firstly, pixels in occluded regions, i.e. regions visible in one of the images but not in the other one(s), cannot be correctly estimated. Secondly, images used for depth estimation are always affected by some level of sensor noise, which affects the accuracy of depth maps. Further, brightness constraints imposed on images used in depth estimation algorithms, such as the assumption that the brightness value of a pixel does not change between the views, are difficult to meet in practice.
As an alternative to using an estimation algorithm, depth maps can be obtained by specialized cameras, e.g., infrared or time-of-flight (ToF) cameras. Unfortunately, current ToF sensors do not yet provide competitive resolutions compared to video cameras.
Transmission of depth maps in a reduced resolution seems to be a valid and desirable solution. Being simpler than the regular video signals, depth maps can be down sampled without too much loss of information. Thus not only the bitrate is reduced, but also a constraint by the display manufacturers is met. This motivates the search for new effective depth upscaling concepts.
Standard depth upscaling methods such as nearest neighbor, linear, bilinear or bicubic interpolation provide only limited quality results. For example, one common artifact when using these methods is a smearing of object borders in synthesized views, as can be seen in FIG. 1b. 
Attempts have been made to solve the problems of these standard upscaling methods by taking all available data into account and utilize the full resolution texture image in the upscaling process. There are several different approaches for this, like the use of Markov Random Fields (MRF) or joint-bilateral upscaling (JBU). Especially JBU gained a lot of interest and lead to several extensions: Further, a noise-aware filter has been suggested (by Chan et al. [1]) for depth upsampling (NAFDU), switching between bilateral & joint-bilateral filtering depending on a pre-filtered depth map. Further, the JBU filtering has been expanded (by Garcia et al. [2]) with a credibility map, weighting every pixel based on the ToF depth map.
These methods, however, introduces other errors. For example, one of the main error-sources in JBU-based approaches is copying of texture-information into smooth depth areas, as shown in FIG. 2a-c. 