Three dimensional (3D) displays add a third dimension to the viewing experience by providing a viewer's two eyes with different views of the scene being watched. This can be achieved by having the user wear glasses to separate two views that are displayed. However, as this may be considered inconvenient to the user, it is in many scenarios preferred to use autostereoscopic displays that use means at the display (such as lenticular lenses, or barriers) to separate views, and to send them in different directions where they individually may reach the user's eyes. For stereo displays, two views are required whereas autostereoscopic displays typically require more views (such as e.g. nine views).
In order to fulfill the desire for 3D image effects, content is created to include data that describes 3D aspects of the captured scene. For example, for computer generated graphics, a three dimensional model can be developed and used to calculate the image from a given viewing position. Such an approach is for example frequently used for computer games which provide a three dimensional effect.
As another example, video content, such as films or television programs, are increasingly generated to include some 3D information. Such information can be captured using dedicated 3D cameras that capture two simultaneous images from slightly offset camera positions. In some cases, more simultaneous images may be captured from further offset positions. For example, nine cameras offset relative to each other could be used to generate images corresponding to the nine viewpoints of a nine view autostereoscopic display.
However, a significant problem is that the additional information results in substantially increased amounts of data, which is impractical for the distribution, communication, processing and storage of the video data. Accordingly, the efficient encoding of 3D information is critical. Therefore, efficient 3D image and video encoding formats have been developed that may reduce the required data rate substantially.
A popular approach for representing three dimensional images is to use one or more layered two dimensional images with associated depth data. For example, a foreground and background image with associated depth information may be used to represent a three dimensional scene, or a single image and associated depth map can be used.
The encoding formats allow a high quality rendering of the directly encoded images, i.e. they allow high quality rendering of images corresponding to the viewpoint for which the image data is encoded. The encoding format furthermore allows an image processing unit to generate images for viewpoints that are displaced relative to the viewpoint of the captured images. Similarly, image objects may be shifted in the image (or images) based on depth information provided with the image data. Further, areas not represented by the image may be filled in using occlusion information if such information is available.
However, whereas an encoding of 3D scenes using one or more images with associated depth maps providing depth information allows for a very efficient representation, the resulting three dimensional experience is highly dependent on sufficiently accurate depth information being provided by the depth map(s).
Furthermore, much content is generated or provided as stereo images without associated depth information. For many operations, it is accordingly desirable to determine depth information for the scene and image objects based on depth estimation. In practice, the disparity between images directly reflect the depth of an object, and the terms depth and disparity are often used interchangeably. Specifically, a disparity value is also a depth value, and a depth value is also a disparity value.
Many different techniques are known for depth/disparity information. Disparity estimation may be used for various 3D-related applications including for example multi-view rendering from stereo, disparity adjustment for stereo viewing, machine vision for robot navigation, etc.
In disparity estimation, a distance between corresponding points in two or more image is estimated, usually with the intention to infer depth via triangulation using known camera parameters. For example, if two images corresponding to different viewing angles are provided, matching image regions may be identified in the two images and the depth/disparity may be estimated by the relative offset between the positions of the regions. Thus, algorithms may be applied to estimate disparities between two images with the disparities directly indicating a depth of the corresponding objects. The detection of matching regions may for example be based on a cross-correlation of image regions across the two images. An example of disparity estimation may be found in D. Scharstein and R. Szeliski.“A taxonomy and evaluation of dense two-frame stereo correspondence algorithms”, International Journal of Computer Vision, 47(1/2/3):7-42, April-June 2002.
However, although disparity estimation may be useful for determining depth information in many situations, it tends to not provide ideal performance and the generated depth information may be quite noisy and comprise inaccuracies.
US2011/234765 discloses an apparatus capable of suppressing erroneous correction which easily occurs in the vicinity of the boundary between the foreground and the background and generating a parallax map with high accuracy.
US2013/308826 discloses that when a peak of the frequency distribution appears discretely on the histogram where the parallax (distance information) is a variable, and the distribution width of the distance information is wide, a target region expressed as a histogram is normally a region where a closer object and a farther object whose distances from the stereo camera are discrete coexist and is called “perspective conflict region”.
In many cases, a color-adaptive (bi-lateral) filter with a large filter kernel is applied to either up-scale a low-resolution disparity estimate or more often to reduce errors/noise in the disparity estimates. When applied to image based rendering of 3D video, this filter ensures stable and often smooth disparity maps. However, it also results in new artifacts caused by the filtering operation. If an object, and in particular its edge, has a varying color profile, the disparity values will also tend to (incorrectly) vary over the edge. Such a varying color profile can be caused for example by lighting changes or shadows. This causes disparity variations over the object, and results in distorted edges in synthesized views. These distortions are disturbing for a human observer, as our human visual system is particularly sensitive to (distortions in) straight edges.
Such distortions may cause a significant perceived quality reduction by a human observer, such as e.g. when graphics overlays are present.
To illustrate this, the stereo images of FIG. 1 may be considered. In the example, a textured image is overlaid by a graphics overlay. FIG. 2 illustrates the left image and an estimated block-based disparity. Errors are clearly visible in the disparity map. FIG. 3 illustrates the left image and estimated disparity after color adaptive filtering has been applied to the disparity map of FIG. 2. Although the disparity map is less blocky and appears smoother, there are still substantial disparity errors in the area around the text.
Similar depth and disparity artefacts may arise for other approaches of generating depth information for three dimensional images, and may degrade perceived quality of the resulting three dimensional images that are presented to a user.
Hence, generation of improved disparity data would be advantageous and in particular generation or determination of disparity values allowing increased flexibility, reduced complexity, facilitated implementation, improved perceived depth, improved performance, reduced perceived depth artifacts, and/or an improved three dimensional image would be advantageous.