1. Technical Field
The invention is related to layered representations of digital or digitized images, and more particularly to a system and process for generating a two-layer, 3D representation of a scene.
2. Background Art
For several years now, viewers of TV commercials and feature films have been seeing the “freeze frame” effect used to create the illusion of stopping time and changing the camera viewpoint. The earliest commercials were produced by using a film-based system, which rapidly jumped between different still cameras arrayed along a rail to give the illusion of moving through a frozen slice of time.
When it first appeared, the effect was fresh and looked spectacular, and soon it was being emulated in many productions, the most famous of which is probably the “bullet time” effects seen in the movie entitled “The Matrix”. Unfortunately, this effect is a one-time, pre-planned affair. The viewpoint trajectory is planned ahead of time, and many man hours are expended to produce the desired interpolated views. Newer systems are based on video camera arrays, but still rely on having many cameras to avoid software view interpolation.
Thus, existing systems would not allow a user to interactively change to any desired viewpoint while watching a dynamic image-based scene. Most of the work on image-based rendering (IBR) in the past involves rendering static scenes, with two of the best-known techniques being Light Field Rendering [11] and the Lumigraph [7]. Their success in high quality rendering stems from the use of a large number of sampled images and has inspired a large body of work in the field. One exciting potential extension of this groundbreaking work involves interactively controlling viewpoint while watching a video. The ability of a user to interactively control the viewpoint of a video enhances the viewing experience considerably, enabling such diverse applications as new viewpoint instant replays, changing the point of view in dramas, and creating “freeze frame” visual effects at will.
However, extending IBR to dynamic scenes is not trivial because of the difficulty (and cost) of synchronizing so many cameras as well as acquiring and storing the images. Not only are there significant hurdles to overcome in capturing, representing, and rendering dynamic scenes from multiple points of view, but being able to do this interactively provides a significant further complication. To date attempts to realize this goal have not been very satisfactory.
In regard to the video-based rendering aspects of an interactive viewpoint video system, one of the earliest attempts at capturing and rendering dynamic scenes was Kanade et al's Virtualized Reality system [10], which involved 51 cameras arranged around a 5-meter geodesic dome. The resolution of each camera is 512×512 and the capture rate is 30 fps. They extract a global surface representation at each time frame, using a form of voxel coloring [14] based on the scene flow equation [17]. Unfortunately, the results look unrealistic because of low resolution, matching errors, and improper handling of object boundaries.
Carranza et al. [3] used seven synchronized cameras distributed around a room looking towards its center to capture 3D human motion. Each camera is at CIF resolution (320×240) and captures at 15 fps. They use a 3D human model as a prior to compute 3D shape at each time frame.
Yang et al. [18] designed an 8×8 grid of cameras (each 320×240) for capturing a dynamic scene. Instead of storing and rendering the data, they transmit only the rays necessary to compose the desired virtual view. In their system, the cameras are not genlocked; instead, they rely on internal clocks across six PCs. The camera capture rate is 15 fps, and the interactive viewing rate is 18 fps.
Common to the foregoing systems is that a lot of images are required for realistic rendering, partially because the scene geometry is either unknown or known to only a rough approximation. If geometry is known accurately, it is possible to reduce the requirement for images substantially [7]. One practical way of extracting the scene geometry is through stereo, and a lot of stereo algorithms have been proposed for static scenes [13]. However, there have been a few attempts at employing stereo techniques with dynamic scenes. As part of the Virtualized Reality work [10], Vedula et al. [17] proposed an algorithm for extracting 3D motion (i.e., correspondence between scene shape across time) using 2D optical flow and 3D scene shape. In their approach, they use a voting scheme similar to voxel coloring [14], where the measure used is how well a hypothesized voxel location fits the 3D flow equation.
Zhang and Kambhamettu [19] also integrated 3D scene flow and structure in their framework. Their 3D affine motion model is used locally, with spatial regularization, and discontinuities are preserved using color segmentation. Tao et al. [16] assume the scene is piecewise planar. They also assume constant velocity for each planar patch in order to constrain the dynamic depth map estimation.
In a more ambitious effort, Carceroni and Kutulakos [2] recover piecewise continuous geometry and reflectance (Phong model) under non-rigid motion with known lighting positions. They discretize the space into surface elements (“surfels”), and perform a search over location, orientation, and reflectance parameter to maximize agreement with the observed images.
In an interesting twist to the conventional local window matching, Zhang et al. [20] use matching windows that straddle space and time. The advantage of this method is that there is less dependence on brightness constancy over time.
Active rangefinding techniques have also been applied to moving scenes. Hall-Holt and Rusinkiewicz [8] use projected boundary-coded stripe patterns that vary over time. There is also a commercial system on the market called ZCam™ manufactured by 3DV Systems of Israel, which is a range sensing video camera add-on used in conjunction with a broadcast video camera. However, it is an expensive system, and provides single viewpoint depth only, which makes it less suitable for multiple view-point video.
However, despite all the advances in stereo and image-based rendering, it is still very difficult to render high-quality, high resolution views of dynamic scenes. One approach, as suggested in the Light Field Rendering paper [11], is to simply resample rays based only on the relative positions of the input and virtual cameras. As demonstrated in the Lumigraph [7] and subsequent work, however, using a 3D impostor or proxy for the scene geometry can greatly improve the quality of the interpolated views. Another approach is to create a single texture-mapped 3D model [10], but this generally produces inferior results to using multiple reference views. Yet another approach employs a geometry-assisted image-based rendering approach that requires a 3D proxy. One possibility is to use a single global polyhedral model, as in the Lumigraph and Unstructured Lumigraph papers [1]. Another possibility is to use per-pixel depth, as in Layered Depth Images [15], offset depth maps in Facade [5], or sprites with depth [15]. In general, using different local geometric proxies for each reference view [12, 6, 9] produces higher quality results.
However, even multiple depth maps still exhibit rendering artifacts when generating novel views, i.e., aliasing (jaggies) due to the abrupt nature of the foreground to background transition and contaminated colors due to mixed pixels, which become visible when compositing over novel backgrounds or objects.
This problem is addressed in the present invention via a unique two-layer, 3D representation of input images. It is noted that not only can this two-layer, 3D representation be used to resolve the foregoing aliasing problem in connection with rendering novel views in an interactive viewpoint video system, but can also be employed advantageously in other contexts as well. In general, any digital or digitized image can be represented using this two-layer, 3D representation.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [2, 3]. A listing of references including the publications corresponding to each designator can be found at the end of the Detailed Description section.