In order to reconstruct a 3D scene from a stereo-video sequence it is necessary know the underlying camera poses and camera parameters. They can be obtained, for example, with the help of a structure from motion (SFM) algorithm. The problem of creating a dense model from this information and the available images is generally referred to as multi-view stereo (MVS).
Consider the simplest case of two images of a static scene taken by two cameras with known camera poses and camera parameters, i.e. a stereo-frame. From this data a dense model can be created as follows. The light from a 3D point in the scene hits the two camera sensors at different locations. If these locations are known, the depth of the point can be computed by triangulation. The process of finding such a pixel correspondence is referred to as disparity estimation. Applying this principle to all pixels leads to a dense 3D point cloud. In the following, one image together with the camera pose and parameters and the depth information are referred to as a “view”. Note that the depth estimates of a view need not necessarily be derived from a stereo frame, but could also be obtained from a time-of-flight sensor or a structured-light sensor, for example.
Typically, work in this field focuses on the 3D reconstruction from a video sequence of a single camera or from several still images. The 3D reconstruction from a video-sequence of stereo images has, so far, not received much attention. Here, the relative pose of the two cameras comprising the stereo camera is fixed. The relative pose can be precisely estimated together with the camera parameters in a calibration phase. Therefore, for each stereo-image, depth estimates can be computed without errors stemming from the pose estimation. However, the disparity estimation process is prone to produce errors. In addition, the pose of the stereo-camera for different times still needs to be estimated.
A problem in 3D reconstruction are outliers among the views. Few authors have considered this problem. Simple approaches were proposed by E. Tola et al.: “Efficient large-scale multi-view stereo for ultra-high resolution image sets”, Machine Vision and Applications Vol. 23 (2012), pp. 903-920, and S. Shen: “Depth-Map Merging for Multi-View Stereo with High Resolution Images”, 21st International Conference on Pattern Recognition (ICPR) (2012), pp. 788-791. In both publications each 3D point from a main view is projected into each of N neighboring views. In each neighboring view, this yields a pixel location. From the depth information recorded for such a pixel, another 3D point is obtained. If the distance to the original 3D point relative to the depth of the pixel in the neighboring view is below some threshold, the neighboring view is considered to be in agreement with the main view.
The depth information of the corresponding pixel in the main view is kept if there is agreement for n≧δ neighboring views, where δ is a free parameter. This approach does not distinguish between conflicts and possible occlusions. If δ<N, depth estimates may be kept when there is no agreement due to an occlusion, but also if there is strongly contradicting information from one or more side views. Furthermore, it is questionable whether or not the distance computed relative to a depth is the best measure for the comparison.
A further related publication is P. Merrell et al.: “Real-Time Visibility-Based Fusion of Depth Maps”, IEEE 11th International Conference on Computer Vision (ICCV) (2007), pp. 1-8. In this publication the authors also consider one main view and N neighboring views, and two algorithms are presented for improving the depth estimates of the main view. In contrast to the previous works, they first project all pixels from the N neighboring views into the main view, leading to several depth estimates for each pixel of the main view. Both algorithms compute a new depth estimate from this information.
In the first approach, for each pixel of the main view, the algorithm starts with the smallest depth estimate and evaluates a stability function related to occlusions and free-space-violations of the corresponding 3D point. For finding free-space violations, the 3D point related to the current depth hypothesis needs to be projected into all N neighboring views. The underlying idea of the stability function is that a free space violation suggests that the depth of the pixel of the main view is underestimated, while an occlusion suggests that it is overestimated. The depth of a pixel is called stable when there is as much evidence that the depth is overestimated as for it being underestimated. The minimal stable depth hypothesis is chosen, and support for it is computed from the confidence of depth estimates which agree with it. Here agreement is based on the relative distance as above.
In the second approach a depth estimate is fused with all depth estimates agreeing with it. The confidence is the sum of the confidences of all agreeing depth estimates minus the confidences of the conflicting ones.
This second approach was extended recently in X. Hu et al.: “Least Commitment, Viewpoint-based, Multi-view Stereo”, Proceedings of the 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission (3DIMPVT) (2012), pp. 531-538. The author considered the possibility of having more than one depth estimate per pixel in the main view and defined agreement not based on the relative distance as above but on the geometrical uncertainty of each depth estimate.
A problem of the first algorithm is that it is questionable whether or not free space violations and occlusions really indicate the under- and overestimation of the original pixel's depth. Furthermore, a depth might be called stable even when there are strong conflicts with respect to some neighboring views. A disadvantage of both algorithms is that the collection of all pixels from the neighboring views which project into the same pixel in the main view is computationally much more demanding than projecting from the main view into the neighboring views. Also, in contrast to the approaches of E. Tola et al. and Shen, generally many more (up to N2) than N projections from one view into another one are required, which are computationally costly.