The present invention relates to a virtual viewpoint image synthesizing method and a virtual viewpoint image synthesizing system, in which a synthetic image viewed from a virtual viewpoint is obtained based on images captured at a plurality of locations.
A virtual viewpoint image is an image which appears as if captured by a real camera at a virtual location. For example, when an object and its background are captured by two cameras, an image is generated which appears as if captured from a position between the two real cameras. Such an image is referred to as the “virtual viewpoint image”.
A process of generating the virtual viewpoint image is referred to as “rendering” or “view synthesis”. Hereinafter, a “viewpoint image” refers to an image viewed from a specified viewpoint, and is either captured by an actual camera or generated through a process of view synthesis. Furthermore, the term “image” as used herein refers to a digital image composed of image pixels.
A human can perceive depth because it sees a different view with each eye. A basis of a state-of-the-art 3D video system (such as a 3D-TV or Free viewpoint TV) is to generate two viewpoint images, one for each eye. To provide freedom in viewpoint, many viewpoint images are needed. Information of a 3D scene can be obtained and represented in many ways.
A popular 3D scene representation is based on N-view and N-depth images, in which depth images represent scene geometry. FIG. 10 shows a generalized system diagram of a multi-viewpoint video system based on a plurality of views and geometry.
A plurality of color views are generally captured by a plurality of synchronized cameras. Geometry information can be represented by, for example, 3D models or per-pixel depth images. When a depth-image-based rendering is used, an unlimited number of virtual viewpoint images, which appear as if captured by actual cameras, can be synthesized within a given range (see Non-Patent Literature 1, for example).
The depth-image-based rendering is a virtual view synthesis process projecting image pixels of a given viewpoint image to another viewpoint image using a per-pixel depth value. This projection is generally referred to as 3D warping.
One of the advantages of N-view and N-depth representation is that a required processing at a receiver side is relatively low. Furthermore, a required transmission/storage bandwidth can be reduced. For example, if a 3D display requires 20 viewpoint images, it can be sufficient to transmit only two or three views and depth maps corresponding thereto, instead of transmitting 20 viewpoint images.
Generally, in multi-viewpoint video systems, a plurality of depth maps and views are compressed for storage or transmission. Efficient compression of both depth maps and views, and reliable high quality virtual view synthesis are important in such systems.
In a conventional approach to down-/up-sample the depth for compression, up-sampling methods have been used in which interpolated samples are estimated from only low-resolution depth maps (see Non-Patent Literatures 2 and 3, for example).
Non-Patent Literature 1: C. Fehn, “Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV”, Proc. SPIE Stereoscopic Displays and Virtual Reality Systems, XI, pp. 93-104 (January 2004)
Non-Patent Literature 2: S. Shimizu, M. Kitahara, H. Kimata, K. Kamikura, and Y. Yashima, “View scalable multiview video coding using 3-D warping with depth map”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, pp. 1485-1495, November 2007
Non-Patent Literature 3: K-J. Oh, S. Yea, A. Vetro, Y-S. Ho, “Depth Reconstruction Filter and Down/Up Sampling for Depth Coding in 3-D Video”, IEEE signal processing letters, vol. 16, No. 9, September 2009, pp. 747-750