1. Field of the Invention
This invention relates generally to image and video synthesis, more particularly to the synthesis of light field image data used as input for light field 3D imaging systems. The term “light field” describes the transmission and modulation of the light including, direction, amplitude, frequency and phase, therefore encapsulates imaging systems that utilize techniques such as holography, integral imaging, stereoscopy, multi-view imaging, Free-viewpoint TV (FTV) and the like.
2. Prior Art
Light Field displays modulate the light's intensity and direction for reconstructing the 3D objects of a scene without requiring specialized glasses for viewing. In order to accomplish this, light field displays usually utilize a large number of views, which imposes several challenges in the acquisition and transmission stages of the 3D processing chain. Compression is a necessary tool to cope with the huge data sizes involved, and commonly systems sub-sample the views at the generation stage and reconstruct the absent views at the display. For example, in Yan et al., “Integral image compression based on optical characteristic,” Computer Vision, IET, vol. 5, no. 3, pp. 164, 168 (May 2011) and Yan Piao et al., “Sub-sampling elemental images for integral imaging compression,” 2010 International Conference on Audio Language and Image Processing (ICALIP), pp. 1164, 1168 (23-25 Nov. 2010), the authors perform sub-sampling of elemental image based on the optical characteristics of the display system. A more formal approach to light field sampling can be found in the works of Jin-Xiang Chai et al., (2000) Plenoptic sampling, in Proceedings of the 27th annual conference on Computer graphics and interactive techniques (SIGGRAPH '00) and Gilliam, C. et al., “Adaptive plenoptic sampling”, 2011 18th IEEE International Conference on Image Processing (ICIP), pp. 2581, 2584 (11-14 Sep. 2011). In order to reconstruct the views at the display side, several different methods can be used from computer graphics methods to image-based rendering.
In computer graphics, the act of creating a scene or a view of a scene is known as view rendering. Usually, a complex 3D geometrical model incorporating lighting and surface properties from the camera point of view is used. This view rendering generally requires multiple complex operations and a detailed knowledge of the scene geometry. Alternatively, Image-Based Rendering (IBR) replaces the use of complex 3D geometrical models with the use of multiple surrounding viewpoints to synthesize views directly from input images that oversample the light field. Although IBR generates more realistic views, it requires a more intensive data acquisition process, data storage, and redundancy in the light field. To reduce the data handling penalty, Depth Image-Based Rendering (DIBR) uses depth information from the 3D geometrical model to reduce the number of required IBR views. (See U.S. Pat. No, 8,284,237, “View Synthesis Reference Software (VSRS) 3.5,” wg11.sc29.org, March 2010, and C. Fehn, “3D-TV Using Depth-Image-Based Rendering (DIBR),” in Proceedings of Picture Coding Symposium, San Francisco, Calif., USA, December 2004.) Each view has a depth associated with each pixel position, known as depth maps, which are then used to synthesize the absent views.
DIBR methods, like the ones depicted in FIG. 1, usually have three distinct stages: namely, view warping (or view projection), view merging 105 and hole filling 107. View warping is the reprojection of a scene captured by one camera to the image plane of another camera. This process utilizes the geometry of the scene, provided by the per-pixel depth information within the reference view, and the characteristics of the capturing device, i.e., the intrinsic (focal length, principal point) and extrinsic (rotation, 3D position) parameters of the camera (C. Fehn, “3D-TV Using Depth-Image-Based Rendering (DIBR),” in Proceedings of Picture Coding Symposium, San Francisco, Calif., USA, December 2004). The projection can be done in two separate stages: a forward warping 103 stage, projecting only the disparity values, and a backward warping stage 106, fetching the color value from the references. Since disparity warping can be affected by rounding and depth quantization, an optional disparity filtering 104 block can be added to the system to correct erroneous warped disparity values.
After one reference view is warped, parts of the target image might still be unknown. Since objects at different depths move with different apparent speeds, part of the scene hidden by one object in the reference view may be disoccluded in the target view, while the color information of this part of the target view is not available from the reference. Typically, multiple references are used to try to cover the scene from multiple view points, so that disoccluded parts of one reference can be obtained from another reference image. With multiple views, not only the disoccluded parts of the scene can come from different references, but also parts of the scene can be visualized by multiple references at the same time. Hence, the warped views of the references may be complementary and overlapping at the same time. View merging 105 is the operation of bringing these multiple views together into one single view. If pixels from different views are mapped to the same position, the depth value is used to determine the dominant view, which will be given by either the closest view or an interpolation of several views.
Even with multiple views, the possibility exists that part of the scene visualized at the target view has no correspondence to any color information in the reference views. Those positions lacking color information are called holes, and several hole filling 107 methods have been proposed to fill these holes with color information from surrounding pixel values. Usually holes are generated from object disocclusion, and the missing color is highly correlated to the background color. Several methods to fill in the holes according to the background information have been proposed (Kwan-Jung Oh et al., “Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-D video,” Picture Coding Symposium, 2009. PCS 2009, pp. 1, 4, 6-8, May 2009).
Due to the limitation of the display devices resolution, DIBR methods have not been satisfactorily applied to full parallax light field images. However, with the advent of high resolution display devices having very small pixel pitch (U.S. Pat. No. 8,567,960), view synthesis of full parallax light fields using DIBR techniques is feasible.
Levoy et al used light ray interpolation between two parallel planes to capture a light field and reconstruct its view points (Marc Levoy et al., (1996) “Light field rendering” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (SIGGRAPH '96)). However, to achieve realistic results, this approach requires huge amounts of data to be generated and processed. If the geometry of the scene, specifically depth, is taken into account, then a significant reduction in data generation and processing can be realized.
In Steven J. Gortler et al., (1996) “The lumigraph” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (SIGGRAPH '96), the authors propose the use of depth to correct the ray interpolation, and in Jin-Xiang Chai et al., (2000) “Plenoptic sampling” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques (SIGGRAPH '00) it was shown that the rendering quality is proportional to the number of views and the available depth. When more depth information is used, fewer references are needed. Disadvantageously, though, depth image based rendering methods have been error prone due to inaccurate depth values and the precision limitation of the synthesis methods.
Depth acquisition is a complicated problem by itself. Usually systems utilize an array of cameras, and the depth of an object can be estimated by corresponding object features at different camera positions. This approach is prone to errors due to occlusions or smooth surfaces. Lately, several active methods for depth acquisition have been used, such as depth cameras and time-of-flight cameras. Nevertheless, the captured depth maps still present noise levels that despite low amplitude adversely affect the view synthesis procedure.
In order to cope with inaccurate geometry information, many methods apply a pre-processing step to filter the acquired depth maps. For example, in Kwan-Jung Oh et al., “Depth Reconstruction Filter and Down/Up Sampling for Depth Coding in 3-D Video,” Signal Processing Letters, IEEE, vol. 16, no. 9, pp. 747,750 (September 2009), a filtering method is proposed that smoothes the depth map while enhancing its edges. In Shujie Liu et al., “New Depth Coding Techniques With Utilization of Corresponding Video”, IEEE Transactions on Broadcasting, vol. 57, no. 2, pp. 551, 561, (June 2011), the authors propose a trilateral filter, which adds the correspondent color information to the traditional bilateral filter to improve the matching between color and depth. Nevertheless, the pre-processing of depth information does not eliminate synthesis artifacts and can be computationally intensive and impractical for low-latency systems.
A problem for view merging is the color mismatch between views. In Yang L et al., (2010) “Artifact reduction using reliability reasoning for image generation of FTV” J Vis Commun Image Represent, vol 21, pp 542-560 (July-August 2010), the authors propose the warping of a reference view to another reference view position in order to verify the correspondence between the two references. Unreliable pixels, that is, pixels that have a different color value in the two references, are not used during warping. In order not to reduce the number of reference pixels, the authors from “Novel view synthesis with residual error feedback for FTV,” in Proc. Stereoscopic Displays and Applications XXI, vol. 7524, January 2010, pp. 75240L-1-12 (H. Furihata et al.) propose the use of a color correcting factor obtained from the difference between the corresponding pixels in the two reference views. Although the proposed method improved rendering quality, the improvement came at the cost of increased computational time and memory resources to check pixel color and depth.
Since prior-art synthesis methods are optimized for reference views close to each other, DIBR methods are less effective for light field sub-sampling, wherein reference views are further apart from each other. Furthermore, to reduce the data handling load, prior-art methods for view synthesis usually target horizontal parallax views only; vertical parallax information is left unprocessed.
In the process of 3D coding standardization (ISO/IEC JTC1/SC29/WG11, Call for Proposals on 3D Video Coding Technology, Geneva, Switzerland, March 2011), view synthesis is being considered as part of the 3D display processing chain, since it allows the decoupling of the capturing and the display stages. By incorporating view synthesis at the display side, fewer views need to be captured.
While the synthesis procedure is not part of the norm, the MPEG group provides a View Synthesis Reference Software (VSRS, U.S. Pat. No. 8,284,237) to be used in the evaluation of 3D video systems. The VSRS software implements state-of-the-art techniques for view synthesis, including all three stages: view warping, view merging and hole filling. Since VSRS can be used with any kind of depth (including ground-truth depth maps obtained from computer graphics models up to estimated depth maps from stereo pair images), many sophisticated techniques were incorporated to adaptively deal with depth maps imperfections and synthesis inaccuracies. For example, FIG. 2 shows the flowchart of the adaptive merging operation adopted by VSRS. For the synthesis, only two views are used to determine the output 201, a left view and a right view. First, the absolute value of the difference between left and right depths is compared to a pre-determined threshold 202. If this difference is larger than a pre-defined threshold (indicating that the depth values are very different from each other, and possibly related to objects in different depth layers), then the smallest depth value 203 determines the object that is closer to the camera, and the view is assumed to be either the left view 207, or the right view 208. In case the depth values are close to each other, then the number of holes is used to determine the output view. The absolute difference between the number of holes in the left and right views is compared 205 to a pre-determined threshold. In case both views have a similar number of holes, then an average 209 of the pixels coming from both views is used. Otherwise, the view with fewer holes 206 is selected as the output view. This procedure is effective for unreliable warped pixels, it detects wrong values and rejects them, but at the same time requires a high computational cost, since a complicated view analysis (depth comparison and hole counting) is done for each pixel separately.
VSRS uses horizontal camera arrangement and utilizes only two references. It is optimized for synthesis of views with small baselines (that is, views that are close to each other). It does not use the vertical camera information and is not suited to be used in light field synthesis. In Graziosi et al., “Depth assisted compression of full parallax light fields”, IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics (Mar. 17, 2015), a synthesis method that targets light fields and uses both the horizontal and vertical information was introduced. The method called MR-DIBR (Multiple Reference Depth-Image Based Rendering) is depicted in FIG. 3 and utilizes multiple references 321, 322 and 323 with associated disparities 301, 302 and 303 to render the light field. At first, the disparities are forward warped 305 to the target position. Next, a filtering method 310 is applied to the warped disparities to mitigate artifacts such as cracks caused by inaccurate pixel displacement. The following step is to merge 315 all the filtered warped disparities. Pixels with smaller depth (closest to the viewer) are selected. VSRS blends color information from two views with similar depth values and obtains a blurred synthesized view; in contrast, the invention in Graziosi et al., “Depth assisted compression of full parallax light fields”, IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics (Mar. 17, 2015) utilizes only one view after merging to preserve the high resolution of the reference view. Moreover, rendering time is reduced due to simple copying of the color information from only one reference rather than interpolating several references. Finally, the merged elemental image disparity 308 is used to backward warp 320 the color from the references' colors 321, 322 or 323 and generate the final synthesized elemental image 326.
The view merging algorithm exhibits quality degradation when the depth values from the reference views are inaccurate. Methods for filtering depth values have been proposed U.S. Pat. No. 8,284,237, C. Fehn, “3D-TV Using Depth-Image-Based Rendering (DIBR),” in Proceedings of Picture Coding Symposium, San Francisco, Calif., USA, (December 2004), and Kwan-Jung Oh et al., “Depth Reconstruction Filter and Down/Up Sampling for Depth Coding in 3-D Video”, Signal Processing Letters, IEEE, vol. 16, no. 9, pp. 747, 750, (September 2009), but they increase the computational requirements of the system and can increase the latency of the display system.