1. Technical Field
The invention is related to stereo reconstruction, and more particularly, to a system and process for extracting 3D structure from plural, stereo, 2D images of a scene by representing the scene as a group of image layers.
2. Background Art
Extracting structure from stereo has long been an active area of research in the imaging field. However, the recovery of pixel-accurate depth and color information from multiple images still remains largely unsolved. Additionally, existing stereo algorithms work well when matching feature points or the interiors of textured objects. However, most techniques are not sufficiently robust and perform poorly around occlusion boundaries and in untextured regions.
For example, a common theme in recent attempts to solve these problems has been the explicit modeling of the 3D volume of the scene. The volume of the scene is first discretized, usually in terms of equal increments of disparity. The goal is then to find the so-called voxels which lie on the surfaces of the objects in the scene using a stereo algorithm. The potential benefits of these approaches can include, the equal and efficient treatment of a large number of images, the explicit modeling of occluded regions, and the modeling of mixed pixels at occlusion boundaries to obtain sub-pixel accuracy. However, discretizing space volumetrically introduces a huge number of degrees of freedom. Moreover, modeling surfaces by a discrete collection of voxels can lead to sampling and aliasing artifacts.
Another active area of research directed toward solving the aforementioned problems is the detection of multiple parametric motion transformations within image sequence data. The overall goal is the decomposition of the images into sub-images (or xe2x80x9clayersxe2x80x9d) such that the pixels within each sub-image move consistently with a single parametric transformation. Different sub-images are characterized by different sets of parameter values for the transformation. A transformation of particular importance is the 8-parameter homography (collineation), because it describes the motion of points on a rigid planar patch as either it or the camera moves. The 8 parameters of the homography are functions of the plane equations and camera matrices describing the motion.
While existing layer extraction techniques have been successful in detecting multiple independent motions, the same cannot be said for scene modeling. For instance, the fact that the plane equations are constant in a static scene (or a scene imaged by several cameras simultaneously) has not been exploited. This is a consequence of the fact that, for the most part, existing approaches have focused on the two frame problem. Even when multiple frames have been considered, it has primarily been solely for the purposes of using past nsegmentation data to initialize future frames. Another important omission is the proper treatment of transparency. With a few exceptions, the decomposition of an image into layers that are partially transparent (translucent) has not been attempted.
The present invention relates to stereo reconstructions that recover pixel-accurate depth and color information from multiple images, including around occlusion boundaries and in untextured regions. This is generally accomplished using an approach to the stereo reconstruction that represents the 3D scene as a collection of approximately planar layers, where each layer has an explicit 3D plane equation and a layer sprite image, and may also be characterized by a residual depth map. The layer sprite refers to a colored image with a defined per-pixel opacity (transparency). The residual depth map refers to a per-pixel depth value relative to the plane. The approach of segregating the scene into planar components allows a modeling of a wider range of scenes. To recover the structure of the scene, standard techniques from parametric motion estimation, image alignment, and mosaicing can be employed.
More specifically, the approach to the stereo reconstruction based on representing the 3D scene as a collection of approximately planar layers involves estimating the desired parameters (e.g. plane equation, sprite image and depth map) by processing several computer program modules, some of which are optional depending on the desired accuracy of the result. The full approach, which is believed to provide the best estimate of the layer parameters, and so the structure of the 3D scene, includes:
(a) inputting plural 2D images as well as camera projection matrices defining the location and orientation of the camera(s) responsible for creating each image, respectively;
(b) assigning each pixel making up each 2D image to one of the plural layers;
(c) estimating a plane equation for each layer that defines the orientation and position of that layer in 3D space;
(d) estimating a sprite image for each layer characterized by a per-pixel color and a per-pixel opacity;
(e) estimating a residual depth map for each layer wherein each residual depth map defines the distance each pixel of the associated layer is offset from the estimated plane of that layer;
(f) re-estimating each layer""s sprite image based on the residual depth map associated with the layer;
(g) re-assigning pixels assigned to a particular layer to another layer by using the estimates for the plane equation, sprite image, and residual depth map for each layer as a guide;
(h) iteratively repeating steps (c) through (g) for each layer until the change in the value of at least one layer parameter relative to its value in an immediately preceding iteration falls below a prescribed threshold assigned to the parameter; and
(i) outputting data representative of the plane equation, sprite image and residual depth map estimates for each layer.
Only the input, pixel assignment, plane equation and sprite image estimation, and output modules (less the residual depth map) are necessary to produce a useable layered representation of the scene. However, the accuracy of the layered representation can be progressively improved with the respective addition of each of the remaining modules, i.e. the depth map estimation, sprite image re-estimation, and pixel re-assignment and iteration modules.
The layered approach to stereo reconstruction shares many of the advantages of the previously described volumetric approaches because the 3D information contained in the layers is used to reason about occlusion and mixed pixels. However, the layered approach according to the present invention offers a number of additional advantages, including:
A combination of the global model (the plane) and the local correction to it (the per-pixel depth map) that results in very robust performance and extremely accurate depth maps;
A layered approach that enables the recovery of scene structure in untextured regions because there is an implicit assumption that untextured regions are planarxe2x80x94this is not an unreasonable assumption, especially in man-made environments;
A form of the output (i.e., a collection of approximately planar regions with per-pixel depth offsets) that is more suitable than a discrete collection of voxels for many applications, including, view interpolation and interactive scene modeling for rendering and video parsing.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.