The task of generating a photo-realistic 3D representation of a visual scene is an important and challenging problem. Debevec et al. demonstrated in their Campanile movie that it is possible, using a user-assisted 3D modeling program and a handful of photos of a college campus, to produce a digital model of the scene that, when rendered, yields images of stunning photorealism from novel viewpoints (see, e.g., P. Debevec et al., “Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach,” SIGGRAPH, pp. 11–20 (1996)). Since this work, there has been much interest in producing results of similar quality using methods that are automatic and work on scenes composed of surfaces of arbitrary geometry.
A standard approach to reconstructing a 3D object using multi-view images is to compute the visual hull. For each reference view, a silhouette is generated by segmenting the photograph into foreground and background. Foreground pixels correspond to points to which the 3D object projects. Everything else is background. Each silhouette constrains the 3D space in which the object is located. If a 3D point projects to background in any of the images, it cannot be part of the 3D object being reconstructed. After eliminating such points, the surface of the region of space that remains is the visual hull. The visual hull is guaranteed to contain the 3D object. Using more reference views produces a visual hull that more closely resembles the geometric shape of the true 3D object. However, even with an infinite number of photographs, the visual hull cannot model surface concavities that are not apparent in the silhouettes.
A variety of methods have been developed to compute the visual hull of a scene. Perhaps the most common approach is to operate in a volumetric framework. A volume that contains the scene being reconstructed is defined. The volume is then tessellated into voxels. All the voxels that project to solely background pixels in one or more reference views are removed (carved). The remaining voxels represent the visual hull and its interior. Such an method can adopt a multi-resolution strategy to achieve faster results.
Recently, researchers have become interested in reconstructing time-varying scenes. Unfortunately, most standard approaches to the 3D scene reconstruction problem such as multi-baseline stereo, structure from motion, and shape from shading are too slow for real-time application on current computer hardware. When working with multi-view video data, most techniques perform the 3D reconstruction offline after the images have been acquired. Once the reconstruction is complete, it is rendered in real-time.
A notable exception is the Image-Based Visual Hulls (IBVH) method that was developed at MIT by Matusik et al. (see, e.g., W. Matusik et al., “Image-based visual hulls,” SIGGRAPH, pp. 369–374 (2000)). This method is efficient enough to reconstruct and render views of the scene in real-time. The key to this method's efficiency is the use of epipolar geometry for computing the geometry and visibility of the scene. By taking advantage of epipolar relationships, all of the steps of the method function in the image space of the photographs (also called “reference views”) taken of the scene.
Referring to FIG. 1, one of the unique properties of the IBVH method is that the geometry it reconstructs is view-dependent. A user moves a virtual camera about the scene. For each virtual camera placement (also called a desired view 10), the IBVH method computes the extent that back-projected rays from the center of projection Cd intersect the visual hull 12 in 3D space. Thus, the representation of the geometry is specified for the desired view, and changes as the user moves the virtual camera.
Consider an individual ray 14, as shown in FIG. 2. The ray is back-projected from the desired view's center of projection Cd, through a pixel 16 in the image plane, and into 3D space. This ray projects to an epipolar line 18, 20 in each reference view 22, 24. The IBVH method determines the 2D intervals where the epipolar lines 18, 20 crosses the silhouette 25, 27. These 2D intervals are then “lifted” back onto the 3D ray 14 using a simple projective transformation. The intervals along the 3D ray from all reference views 22, 24 are intersected. The resultant set of intervals describes where the ray pierces the visual hull 12. These are called “visual hull intervals” herein. In FIG. 2, one visual hull interval 26 is found along the back-projected ray. Once this procedure has been performed on all rays back-projected from the desired view 10, the reconstruction of the view-dependent geometry of the visual hull 12 is complete.
In order to color a point on the visual hull, it is necessary to determine which cameras have an unoccluded view of the point. Thus, visibility must be computed before texture-mapping the reconstructed geometry. In the following discussion, a point in 3D space is represented in homogeneous coordinates by a boldface capital letter, such as P, where P=[x y z w]T. The projection of this point into an image is a 2D point represented in homogeneous coordinates by a boldface lowercase letter, such as p=[x y w]T. To convert a homogeneous image point to inhomogeneous coordinates (i.e. pixel coordinates), one simply divides p by the w component. Thus, a pixel will have coordinates p=[x/w y/w 1]T.
Referring to FIG. 3A, at a pixel p in the desired view, the first point (if any) along the first visual hull interval 28 indicates a point P in 3D space that projects to p and is visible in the desired view. To compute visibility, for each reference view we need to determine if P is visible. P must be visible in the reference view if the line segment {overscore (PCr)} between P and the reference view's center of projection Cr does not intersect any visual hull geometry.
As shown in FIG. 3B, the layered depth image representation of the visual hull makes this easy to determine. In the desired view, {overscore (PCr)} projects to an epipolar line segment {overscore (pe)}, where e is the epipole, found by projecting Cr into the desired view. For each pixel along {overscore (pe)}, the visual hull intervals can be checked to see if they contain geometry that intersects {overscore (PCr)}. If an intersection occurs, point P is not visible in the reference view, and no more pixels along {overscore (pe)} need be evaluated. Otherwise, one continues evaluating pixels along {overscore (pe)}, until there are no more pixels to evaluate. If no visual hull interval has intersected {overscore (PCr)}, then the point P is visible in the reference view. The above-mentioned IBVH paper by Matusik et al. also discusses discretization issues in computing visibility using this approach, as well as occlusion-compatible orderings to improve its efficiency.
Referring to FIG. 4, once visibility has been computed, one can color the visual hull using the reference views. The IBVH paper employs view-dependent texture mapping, which retains view-dependent effects present in the photos, and works well with the inaccurate geometry of the visual hull. To color a point p in the desired view, the closest point P on the hull is found. Then, for each reference view that has visibility of P, the angle between {overscore (PCd)} and {overscore (PCr)} is found. The reference view with the smallest angle is chosen to color the visual hull. This is the reference view that has the “best” view of P for the virtual camera's location. For example, in FIG. 4, reference view 2 would be chosen since θ1>θ2.
While the IBVH method is exceptionally efficient, the geometry it reconstructs is not very accurate. This is because the IBVH method only reconstructs the visual hull of the scene. The visual hull is a conservative volume that contains the scene surfaces being reconstructed. When photographed by only a few cameras, the scene's visual hull is much larger than the true scene. Even if photographed by an infinite number of cameras, many scenes with concavities will not be modeled correctly by a visual hull. One can only partially compensate for such geometric inaccuracies by view-dependent texture-mapping (VDTM), as done in the IBVH approach.
As shown in FIGS. 5A and 5B, however, artifacts resulting from the inaccurate geometry of the visual hull still are apparent in new synthesized views of the scene. In FIGS. 5A and 5B, a person's head photographed from four viewpoints is reconstructed. A new view of the scene, placed half-way between two reference views, is rendered from the reconstruction. The top row shows the visual hull reconstruction. At this viewpoint, the right side of the face is texture-mapped with one reference image, while the left side of the face is texture-mapped with another. Due to the geometric inaccuracy of the visual hull, there is a salient seam along the face where there is a transition between the two images used to texture-map the surface.