1. Technical Field:
The invention is related to a computer-implemented system and process for estimating a motion or depth map for multiple images of a 3D scene, and more particularly, to a system and process for estimating motion or depth maps for more than one image of the multiple images of the 3D scene.
2. Background Art:
Stereo and motion have long been central research problems in computer vision. Early work was motivated by the desire to recover depth maps and coarse shape and motion models for robotics and object recognition applications. More recently, depth maps obtained from stereo (or alternately dense correspondence maps obtained from motion) have been combined with texture maps extracted from input images in order to create realistic 3-D scenes and environments for virtual reality and virtual studio applications. Similarly, these maps have been employed for motion-compensated prediction in video processing applications. Unfortunately, the quality and resolution of most of today""s algorithms falls quite short of that demanded by these new applications, where even isolated errors in correspondence become readily visible when composited with synthetic graphical elements.
One of the most common errors made by these algorithms is a mis-estimation of depth or motion near occlusion boundaries. Traditional correspondence algorithms assume that every pixel has a corresponding pixel in all other images. Obviously, in occluded regions, this is not so. Furthermore, if only a single depth or motion map is used, it is impossible to predict the appearance of the scene in regions which are occluded. This point is illustrated in FIG. 1. FIG. 1 depicts a slice through a motion sequence spatio-temporal volume. A standard estimation algorithm only estimates the motion at the center frame designated by the (⇄) symbol, and ignores other frames such as those designated by the (xe2x86x92) symbols. As can be seen some pixels that are occluded in the center frame are visible in some of the other frames. Other problems with traditional approaches include dealing with untextured or regularly textured regions, and with viewpoint-dependent effects such as specularities or shading.
One popular approach to tackling these problems is to build a 3D volumetric model of the scene [15, 18]. The scene volume is discretized, often in terms of equal increments of disparity. The goal is then to find the voxels which lie on the surfaces of the objects in the scene. The benefits of such an approach include the equal and efficient treatment of a large number of images [5], the possibility of modeling occlusions [9], and the detection of mixed pixels at occlusion boundaries [18]. Unfortunately, discretizing space volumetrically introduces a large number of degrees of freedom and leads to sampling and aliasing artifacts. To prevent a systematic xe2x80x9cfatteningxe2x80x9d of depth layers near occlusion boundaries, variable window sizes [10] or iterative evidence aggregation [14] can be used. Sub-pixel disparities can be estimated by finding the analytic minimum of the local error surface [13] or using gradient-based techniques [12], but this requires going back to a single depth/motion map representation.
Another active area of research is the detection of parametric motions within image sequences [19, 3, 20]. Here, the goal is to decompose the images into sub-images, commonly referred to as layers, such that the pixels within each layer move with a parametric transformation. For rigid scenes, the layers can be interpreted as planes in 3D being viewed by a moving camera, which results in fewer unknowns. This representation facilitates reasoning about occlusions, permits the computation of accurate out-of-plane displacements, and enables the modeling of mixed or transparent pixels [1]. Unfortunately, initializing such an algorithm and determining the appropriate number of layers is not straightforward, and may require sophisticated optimization algorithms to resolve.
Thus, all current correspondence algorithms have their limitations. Single depth or motion maps cannot represent occluded regions not visible in the reference image and usually have problems matching near discontinuities. Volumetric techniques have an excessively large number of degrees of freedom and have limited resolution, which can lead to sampling or aliasing artifacts. Layered motion and stereo algorithms require combinatorial search to determine the correct number of layers and cannot naturally handle true three-dimensional objects (they are better at representing xe2x80x9ccutoutxe2x80x9d scenes). Furthermore, none of these approaches can easily model the variation of scene or object appearance with respect to the viewing position.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [15, 18]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention relates to a new approach to computing dense motion or depth estimates from multiple images that overcomes the problems of current depth and motion estimation methods. In general terms this is accomplished by associating a depth or motion map with each input image (or some subset of the images equal to or greater than two), rather that computing a single map for all the images. In addition, consistency between the estimates associated with different images is ensured by using a motion compatibility constraint and reasoning about occlusion relationships by computing pixel visibilities. This system of cross-checking estimates between images produces richer, more accurate, estimates for the desired motion and depth maps.
More particularly, a preferred process according to the present invention involves using a multi-view framework that generates dense depth or motion estimates for the input images (or a subset thereof). This is accomplished by minimizing a three-part cost function, which consists of an intensity compatibility constraint, a motion or depth compatibility constraint, and a motion smoothness constraint. The motion smoothness term uses the presence of color/brightness discontinuities to modify the probability of motion smoothness violations. In addition, a visibility term is added to the intensity compatibility and motion/depth compatibility constraints to prevent the matching of pixels into areas that are occluded. In operation, the cost function is computed in two phases. During an initializing phase, the motion or depth values for each image being examined are estimated independently. Since there is not yet any motion/depth estimates for other frames to employ in the calculation, the motion/depth compatibility term is ignored. In addition, no visibilities are computed and it is assumed all pixels are visible. Once an initial set of motion/depth estimates have been computed, the visibilities are computed and the motion/depth estimates recalculated using the visibility terms and the motion/depth compatibility constraint. The foregoing process can then be repeated several times using the revised motion/depth estimates from the previous iteration as the initializing estimates for the new iteration, to obtain better estimates of motion/depth and visibility.
The foregoing new approach is motivated by several target applications. One application is view interpolation, where it is desired to generate novel views from a collection of images with associated depth maps. The use of multiple depth maps and images allows modeling partially occluded regions and to model view-dependent effects (such as specularities) by blending images taken from nearby viewpoints [6]. Another application is motion-compensated frame interpolation (e.g., for video compression, rate conversion, or de-interlacing), where the ability to predict bi-directionally (from both previous and future keyframes) yield better prediction results [11]. A third application is as a low-level representation from which segmentation and layer extraction (or 3D model construction) can take place.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.