1. Technical Field
The invention is related to a system and process for extracting 3D structure from plural, stereo, 2D images of a scene by representing the scene as a group of image layers characterized by estimated parameters including the layer's orientation and position, per-pixel color, per-pixel opacity, and optionally a residual depth map, and more particularly, to such a system and process for refining the estimates for these layer parameters.
2. Background Art
Extracting structure from stereo has long been an active area of research in the imaging field. However, the recovery of pixel-accurate depth and color information from multiple images still remains largely unsolved. Additionally, existing stereo algorithms work well when matching feature points or the interiors of textured objects. However, most techniques are not sufficiently robust and perform poorly around occlusion boundaries and in untextured regions.
For example, a common theme in recent attempts to solve these problems has been the explicit modeling of the 3D volume of the scene. The volume of the scene is first discretized, usually in terms of equal increments of disparity. The goal is then to find the so-called voxels which lie on the surfaces of the objects in the scene using a stereo algorithm. The potential benefits of these approaches can include, the equal and efficient treatment of a large number of images, the explicit modeling of occluded regions, and the modeling of mixed pixels at occlusion boundaries to obtain sub-pixel accuracy. However, discretizing space volumetrically introduces a huge number of degrees of freedom. Moreover, modeling surfaces by a discrete collection of voxels can lead to sampling and aliasing artifacts.
Another active area of research directed toward solving the aforementioned problems is the detection of multiple parametric motion transformations within image sequence data. The overall goal is the decomposition of the images into sub-images (or "layers") such that the pixels within each sub-image move consistently with a single parametric transformation. Different sub-images are characterized by different sets of parameter values for the transformation. A transformation of particular importance is the 8-parameter homography (collineation), because it describes the motion of points on a rigid planar patch as either it or the camera moves. The 8 parameters of the homography are functions of the plane equations and camera matrices describing the motion.
While existing layer extraction techniques have been successful in detecting multiple independent motions, the same cannot be said for scene modeling. For instance, the fact that the plane equations are constant in a static scene (or a scene imaged by several cameras simultaneously) has not been exploited. This is a consequence of the fact that, for the most part, existing approaches have focused on the two frame problem. Even when multiple frames have been considered, it has primarily been solely for the purposes of using past segmentation data to initialize future frames. Another important omission is the proper treatment of transparency. With a few exceptions, the decomposition of an image into layers that are partially transparent (translucent) has not been attempted.