Free viewpoint television (FTV), also sometimes denoted multiview video and 3DTV, is a novel audio-visual system that allows users to have a 3D visual experience while freely changing their position in front of a 3D display. Unlike the typical stereoscopic television, which enables a 3D experience to users that are sitting at a fixed position in front of a screen, FTV allows users to observe the scene from many different angles, as if they were there. FTV, consequently, allows the user to interactively control the viewpoint and generates new views of a dynamic scene from any 3D position.
There are two main FTV formats, namely the multiview+depth, also known as 2D+Z, and the layered depth video (LDV) formats, the former being more common. In the multiview+depth representation, the scene is captured by many cameras and from different angles. Multiple views are then jointly compressed, by exploiting both temporal and spatial similarities that exist in different views. In order to further enable the FTV functionality, each camera view should carry additional information—a depth map. The depth map is a simple grayscale image, wherein each pixel indicates the distance between the corresponding pixel from a video object and the capturing camera. From the multiview video and depth information virtual views can be generated at an arbitrary viewing position.
The depth map can be obtained by specialized cameras, e.g. infrared or time-of-flight cameras. However, because of their price, they are still commercially ill-deployed. A common alternative instead estimates depth maps based on a number of neighboring camera views.
Having a good quality depth map is of crucial importance. Errors in depth maps translate to misplacement of pixels in the synthesized view. This is especially visible around object boundaries, where a noisy cloud around the borders becomes visible. The best available depth estimation algorithms still generally produce a quality of depth maps that is far from acceptable. The comparatively low quality in depth map estimation depends on a number of factors. Firstly, pixels in occluded regions, i.e. regions visible from one of the camera view but not in the other(s), cannot be correctly estimated.
Secondly, the neighboring views used for depth estimation are always affected by some level of sensor noise from the recording and processing equipment, which affects the accuracy of the depth maps. Furthermore, brightness constraints imposed on the video frames from the neighboring views used in depth map estimation are difficult to meet in practice.
The problems with low quality in depth maps are, further, not limited to estimated depth maps. Also the specialized cameras used for generating depth maps have limitations and introduce noise that propagates into errors in the depth maps.
There is, thus, a need for a technique allowing identification of incorrect portions in estimated or generated depth maps that can be used for the purpose of improving the accuracy and quality of the depth maps.
Document [5] discloses dynamic scene generation with interactive viewpoint control. In the image processing, an image is initially segmented to compute an initial disparity space distribution (DSD) for each segment. In a second step, the DSD of each segment is refined using neighboring segments. Finally, image matting is used for pixels along disparity discontinuities to reduce artifacts during view synthesis.