For stereoscopic display in 3D-TV, 3D-video and 3D-cinema, a real word scene is captured by two or even more cameras. In most of the practical cases, a scene is captured from two different viewpoints using a stereo camera equipment. An exemplary object in a real word scenario is projected onto different positions within the corresponding camera images. If the parameters for the stereo camera setup are known and the displacement between corresponding points in the stereo images belonging to one and the same object in the real word can be determined, the distance between the real world object and the camera, i.e. the depth of the object, may be calculated by triangulation. The displacement between corresponding points in the stereo images is commonly referred to as disparity, and the task of finding point correspondence between the two input or basic images is typically called stereo matching. The real world 3D-structure of the captured scene can be reconstructed from the disparities.
The disparity information is usually integrated into a disparity map containing the results of the matching calculations. However, the performance of the matching process inherently depends on the underlying image content of the basic images. In an ideal situation, the basic images are captured by two pinhole cameras showing no lens distortion and no color difference, which is however not realistic. By using calibration and rectification this situation may be approximated when taking conventional cameras for capturing the scene. But even for ideal conditions there still remain several problems in the matching process due to e.g. occluded areas in one of the input pictures, perspective deformations, specular reflections, depth discontinuities or missing texture in some of the objects or the background that make the matching process a challenging task.
A further obstacle for stereo matching is the fact that many surfaces of real world objects may not assumed to be truly Lambertian reflectors and specular reflections typically look different from different viewing angles. Another problem is that at object borders the neighborhood area comprises two different, conflicting depth values, and accordingly the matching process is in conflict which depth it has to assign to the respective point in the surrounding area. Other problems for the matching process result from either a lack of sufficient texture in the object's surface or from quasi periodic texture.
Consequently, for some parts of a basic image it is inherently more difficult to determine accurate disparity values, also referred to as disparity estimates, than for others. Moreover, for occluded regions it is only possible to extrapolate the depth information from their surroundings. Occlusions result from the different viewpoints of the cameras so that for some areas in the basic stereo pictures it is always impossible to find a point-to-point correspondence.
The aforementioned problems during disparity estimation lead to varying levels of accuracy and reliability for the disparity values. However, the requirements of applications differ with regard to density, accuracy and reliability of the depth values in the disparity maps. Some applications, e.g. multi-view interpolation for multi-view displays or user-adjustable depth require dense disparity maps, while other applications require only few but highly reliable disparity estimates. This is, for example, the case for stereoscopic positioning of text and graphics for 3D-menus or 3D-subtitles. Multi-view coding as a further application has requirements that are in-between those of multi-view interpolation and stereoscopic positioning, as an exact estimate may be less important than an accurate labeling of occlusions.
Apart from these differences, the level of reliability of the depth information plays an important role. Confidence information can be helpful in subsequent steps of the 3D processing chain, especially for refinement steps aiming to improve the quality of the disparity estimates.
A first approach to a confidence measure is the similarity function employed during the stereo matching. In global optimization schemes, a cost function typically comprises a data term for the similarity and an outlier term for consistency information. An additional smoothness term fosters the piece-wise smoothness of the disparity map by enforcing that depth discontinuities may only occur at color edges.
In most cases, the outlier term is a simple binary variable and all pixels that do not meet the left-right consistency are marked as unreliable. A further approach is done by L. Xu and J. Jia.: “Stereo Matching: An Outlier Confidence Approach”, European Conference on Computer Vision (ECCV) (2008), pp. 775-787, introducing a soft outlier estimate that is assigned to a respective pixel according to their matching cost if they pass the consistency test. A similar approach was done by Q. Yang et al.: “Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation and Occlusion Handling”, IEEE Trans. Pattern Anal. Mach. Intell. Vol. 31 (2009), pp. 492-504, wherein a continuous confidence value is determined by evaluating the uniqueness of the similarity function in terms of the ratio between the first and the second best match.
The quality of a disparity map may be improved by a refining process. Typically, bilateral filters are employed for refining a disparity map. These filters are edge-preserving smoothing filters that employ a domain/spatial filter kernel and a range filter kernel, as exemplarily disclosed by C. Tomasi and R. Manduchi: “Bilateral Filtering for Gray and Color Images”, Sixth International Conference on Computer Vision (1998), pp. 839-846. Typically, the spatial filter kernel evaluates spatial distances, while the range filter kernel evaluates color or intensity differences.