Most of the visual content today is still in the form of two dimensional (2D) images or videos, which are in the form of a sequence of 2D images. Generally, these conventional images and videos do not support a change in the vantage or viewpoint of the observer, other than just magnification/scaling or simple shifting. However, new display technology is becoming more available that provides stereo or three dimensional (3D) images. These are achieved generally with either active shutter or passive polarized eye glasses.
Also, more recently high resolution autostereoscopic displays, which do not require eye glasses, are becoming more available. The input to such autostereoscopic displays is typically i) a video image plus a depth map which describes the depth of each pixel in the video or ii) a set of videos at adjacent viewpoints, sometimes called a multi-view video, wherein the adjacent views are multiplexed onto an image frame in a certain format. A lenticular lens or parallax barriers of the autostereoscopic displays perform a spatial filtering so that a user at a certain viewing position will be able to see two different images in his/her left and right eyes respectively, thus creating 3D perception.
To display conventional 2D images or videos in a 3D display requires the generation of another view of the scene. On the other hand, the display of 3D videos on autostereoscopic displays requires either the generation of a depth map or the creation of appropriate multi-view videos that are to be multiplexed into the desired frame format.
One method to facilitate the generation of these additional views is to augment the videos with corresponding depth maps or their approximated versions. Depth maps are images (or videos if taken at regular time intervals) that record the distances of observable scene points from the optical point of a camera. They provide additional information to the associated color pixels in the color image or video taken at the same position by specifying their depths in the scene. One application of depth maps is to synthesize new views of the scene from the color image or videos (also referred to as texture). Depth maps can also be taken at adjacent spatial locations to form multi-view depth images or videos. Together with the texture or color videos, new virtual views around the imaging locations can be synthesized. See, S. C. Chan et al., “Image-based rendering and synthesis,” IEEE Signal Processing Magazine, vol. 24, pp. 22-33, (2007) and S. C. Chan and Z. F. Gan et al., “An object-based approach to image-based synthesis and processing for 3-D and multiview televisions,” IEEE Trans. Circuits Syst. Video Technology., vol. 19, no. 6, pp. 821-831, (June 2009), which are incorporated herein by reference in their entirety. These synthesized views, if appropriately generated, can support the display of the content in conventional 2D, stereo or autostereoscopic displays and provide limited view point changes.
For conventional videos, augmenting each image frame with a depth map results in an additional depth video and the format is sometimes referred to as the 2D plus depth representation. How the video and depth are put together leads to different formats. In the white paper, 3D Interface Specifications, Philips 3D Solutions, http://www.business-sites.philips.com/shared/assets/global/Downloadablefile/Philips-3D-Interface-White-Paper-13725.pdf, the 2D plus depth format packs the video and depth image side by side together in a frame as a physical input interface to the autostereoscopic displays. There is an extended version called “WOWvx declipse” format where the input frame is further split into four quadrants with two additional sub-frames storing the background occluded by the foreground objects and its depth values, respectively. There is no shape information and hence it is likely to rely on an accurate depth map to locate the boundaries of the foreground objects so that the occluded areas can be filled in during rendering. This may be prone to errors due to acquiring or compression of the depth map. Also the whole occluded background of the objects is required, which is usually unnecessary as the number of occlusion areas depends on the depth and the maximum viewing range. Usually, only the important objects with large depth discontinuities are required to be occluded and the minor occlusion can be handled by “inpainting.” Inpainting (also known as image interpolation or video interpolation) refers to the application of sophisticated algorithms to replace lost or corrupted parts of the image data (mainly replacing small regions or removing small defects). The most significant limitation of this representation is that it can't handle semi-transparent objects as objects, or backgrounds are assumed to be fully occluded. The four-quadrant representation also significantly limits the resolution of all of the principal video and depth. Here, the 2D plus depth format or representation refers to the use of both video plus depth for view synthesis and is not limited to the physical format used in the Philips White Paper.
For 3D videos, each video in the stereo video can be augmented with a depth video. In HEVC-3D, two or more videos are coded together with their respective depth maps. See, G. Tech et al., “3D-HEVC draft text 1,” Proceedings of the 5th Meeting of Joint Collaborative Team on 3D Video Coding Extensions (JCT-3V), Document JCT3V-E1001, Vienna, Austria, August 2013), which is incorporated herein by reference in its entirety. The main motivation for using such a multiple video-plus-depth format is to synthesize new views from two adjacent video-plus-depth videos. The videos and depths have to be compressed and decompressed using the HEVC-3D codec. Videos coded in other formats have to be transcoded together with the depth map to the new HEVC-3D format. One of the applications for view synthesis functionality is to generate multiple views for supporting autostereoscopic displays which generally require 5 or even a larger number of views.
Currently, there are two important problems in such applications, i.e., (1) texture and depth consistency at depth discontinuities, and (2) artifacts from inpainting dis-occluded areas.
The quality of view synthesis using multiple videos and depth maps is highly dependent on the quality of the depth maps. Inaccurate alignment of depth discontinuities between views and inconsistency between the texture and depth discontinuities usually yield severe rendering artifacts around object boundaries. The accuracy required to avoid these difficulties is generally difficult to achieve due to the limited accuracy of depth maps and the distortion introduced after data compression. The consistency of texture and depth discontinuities is also crucial to general 2D plus depth representation since significant artifacts will result if they are not properly handled.
Artifacts can arise from inpainting dis-occluded areas and the image-plus-depth representation during view synthesis. Due to dis-occlusion, holes will appear when the new view is generated from the texture and depth map at shape depth discontinuities. The conventional method to address this problem is to inpaint the holes from neighboring pixels. Though the WOWvx declipse format provides the occlusion data at the physical level, it is unsuitable for transmission and storage where bandwidth or storage is limited. The occlusion data are generally larger than the required view point change. Since it does not have precise shape information, its use will rely heavily on the depth map, which may be subjected to estimation error or compression errors for natural videos. Also, it does not support multiple views and semi-transparent objects. For stereo videos, data have to be appropriately extracted from the other views to inpaint these holes in real-time.
Sometimes, sophisticated inpainting algorithms or even human intervention have to be performed to reduce rendering artifacts. This is due to i) the view point change between the two views, ii) complicated dis-occluded background, and iii) inconsistency between depth and color videos especially at significant depth discontinuities and other reasons. Sophisticated inpainting makes real-time and reliable view synthesis with low artifacts extremely difficult. Moreover, the mismatch in color, edge locations and depth discontinuities of the two views will result in significant ghosting or “double images.”