Embodiments of the present invention relate generally to three-dimensional images, and more specifically to improved three-dimensional image synthesizing using depth image-based rendering (DIBR) and hierarchical hole-filling.
An increasing number of movies and TV programs are being produced and/or presented in stereoscopic 3D format. This trend is being driven, at least in part, by noticeable advances in stereoscopic display technologies. Three dimensional television (3DTV) and 3D mobile TV are widely considered to be the future of multimedia broadcasting. 3DTV and other technologies can bring a more life-like and visually immersive experience to viewers.
In the future, viewers may have, for example, the freedom to navigate through a scene and choose multiple viewpoints. This is known as free-viewpoint TV (FTV). This technology can also be desirable and applicable to, for example and not limitation, movie theaters, presentations, still pictures, computer generated images (CGI), and animation, where viewers view 3D printed or projected image or motion picture.
Producing an FTV image can be complex. To produce stereoscopic 3D videos, for example, each individual viewpoint requires two videos corresponding to the left and right camera views. In addition, true multi-viewpoint video, such as true FTV, for example, can require up to 32 viewpoints (or possibly more). Consequently, capturing and broadcasting arbitrary viewpoints for FTV can require an unrealistically high number of cameras, extremely complex coding, and expensive processors. In addition, advances in 3D display technologies, such as autostereoscopic displays, require flexibility in the number of views and/or the ability to resize each view to match the display resolution. Hence, generating FTV from the multi-camera capture of a large number of views can be cumbersome and expensive.
One alternative is to generate, or synthesize, the intermediate views using view synthesis. One method of view synthesis is the aforementioned DIBR. In DIBR, two or more views for 3D display can be generated from a single 2D image and a corresponding depth map (i.e., an image or image data that contains information relating to the distance of the surfaces in a scene from a particular viewpoint).
DIBR has several advantages including, but not limited to, high bandwidth-efficiency, interactivity, easy 2D to 3D switching, and high computational and cost efficiency. These advantages make it possible for a TV, or other multimedia display device, to receive a 2D image and a depth map, and to convert the 2D image and depth map into a 3D image. In addition, through DIBR, a TV or other multimedia display device can receive a series of 2D images and depth maps, and convert the 2D images and depth maps into 3D images, which can be shown in succession to form 3D video.
In addition, DIBR can be accomplished using one or two cameras (less than would be required if each viewpoint was captured by its own camera or set of cameras). DIBR also eliminates photometric asymmetries between the left and right views because both views are generated from and based on the same original image. The inherent advantages of DIBR have lead the Motion Pictures Expert Group (“MPEG”) to include it in their standard for coding video plus depth format, which is known as MPEG-C part 3. As shown in FIG. 1, the process of producing 3D images from captured content (with a 3DTV system, for example) can comprise six main steps: 3D video capturing and depth content generation 101; 3D content video coding 102; transmission 103; decoding the received sequences 104; generating virtual views 105; and displaying the stereoscopic images on the screen 106.
With DIBR, virtual views can be generated from the reference image and the corresponding depth map using a process known as 3D wrapping. The 3D wrapping technique allows mapping of a pixel at a reference view to a corresponding pixel at a virtual view at a desired location. This can be accomplished by first projecting the pixels at the reference view into world coordinates using explicit geometric information from the depth map and camera parameters. The pixels in the world coordinates can then be projected into the estimated virtual image coordinates to yield a 3D wrapped image.
To better understand DIBR, consider a reference camera Cr and a virtual camera Cv, as shown in FIG. 2. Further consider that Fr and Fv are the focal lengths of the reference and the virtual cameras, respectively (for simplicity, Fr and Fv are assumed to be equal, but do not have to be). Additionally, B is the baseline distance that separates the two cameras, and Zc is the convergence distance of the two cameras axis. The horizontal coordinates vector Xv of the virtual camera as a function of the horizontal coordinate vector Xr of the reference camera is given by:
            X      _        v    =                    X        _            r        +          s      ⁢                                    F            v                    ⁢          B                          Z          _                      +    h  where s=−1 when the estimated view is to the left of the reference view and s=+1 when the estimated view is to the right of the reference view, Z is a vector of the depth values at pixel location (xr, yr), and h is the horizontal shift in the camera axis which can be estimated as:
  h  =            -      s        ⁢                                        F            v                    ⁢          B                          Z          c                    .      In some applications the depth value is presented in terms of disparity maps. In such cases, the depth vector Z at a certain pixel location can be obtained from disparity vector D as:
      Z    _    =                    F        r            ⁢      b              D      _      where b is the original baseline distance of the stereo camera pair used in the disparity calculation. Finally, the wrapping equation can be expressed in terms of disparity as:
      x    v    =            x      r        +          s      ⁢                                    F            c                    ⁢          B          ⁢                      D            _                                                F            r                    ⁢          b                      -          s      ⁢                                    F            v                    ⁢          B                          Z          c                    
3D wrapping does not always result in a perfect image. Synthesized views using 3D wrapping may contain holes for a variety of reasons. Often, the holes are caused by disocclusion, which is primarily caused by two factors. Disocclusion can be caused, for example, by uniform sampling in the reference image becoming non-uniform in the desired image due to the virtual viewing angle. In other cases, holes can be caused simply because formerly occluded areas in the reference image becoming visible in the virtual image. In other words, as the image is manipulated, features come into, and go out of, view. Holes can also be the result of, for example and not limitation, inaccurate depth maps, errors in transmission, or noise in the depth map or image signal. FIGS. 3a-3c show several examples of synthesized images immediately after 3D wrapping. The holes in these figures tend to appear as black areas and/or black lines.
The presence of holes as the result of DIBR is a challenging problem because there is little or no information that can be derived from the depth map or the reference camera about disoccluded areas. One method that has been used to attempt to cure this problem is Gaussian filtering of the depth map, which is generally exemplified in FIG. 4. In this method, a Gaussian filter smoothes sharp transitions in the depth map thereby reducing the size of holes in the 3D wrapped image. The smaller holes can then be filled using an average filter. A problem with this approach is that processing the depth map through smoothing results in geometric distortions, as can be seen in the imperfections in FIGS. 5a and 5b. 
To remedy these distortions; Zhang et al. proposed using a symmetric Gaussian filter followed by an average filtering of the image. L. Zhang & W. J. Tam, Stereoscopic Image Generation Based On Depth Images For 3DTV, IEEE TRANS. ON BROADCASTING, vol. 51, no. 2, pp. 191-199 (June 2005). One drawback of this approach is that it changes the original depth values resulting in a loss of depth cues after wrapping. This loss tends to lead to image distortion.
Criminisi et al. developed an inpainting technique for hole-filling in DIBR. A. Criminisi et al., Object Removal by Exemplar-Based Inpainting, IEEE TRANSACTIONS ON IMAGE PROCESSING, p. 13 (2004). Criminisi's method fills the holes by using a texture synthesis algorithm that gives higher weights to linear structures in an attempt to reduce image distortion. Unfortunately, the subjective results from Criminisi's technique have shown only a very slight improvement over the quality obtained by other methods and the resulting videos tend to suffer from severe flicker as a result of temporal inconsistencies.
Vazquez et al. developed a technique of horizontal interpolation to reduce holes. C. Vazquez et al., Stereoscopic Imaging: Filling Disoccluded Areas in Depth Image-Based Rendering, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 6392 (October 2006). However, this method tends to cause severe and undesirable distortion to the texture of the background.
Thus, there is a need to produce high quality 3D images in a manner that is computationally efficient and requires reduced bandwidth. There is also a need for a method of removing holes from a view synthesized by DIBR that does not distort the image, does not lead to flickering, and results in a high quality image. There is also a need for a method of removing holes from a view synthesized by DIBR when the image is distorted or corrupted due transmission errors and/or distorted or corrupted signals. It is to these issues that embodiments of the present invention are primarily directed.