Given two views of a scene, it is possible to estimate the binocular disparity between corresponding image features. The disparity of a scene-point is determined by its distance from the cameras used to capture images of the scene. The disparity can be used to predict the position of the corresponding image feature in a synthetic view. A synthetic view may be an image of the scene for which no camera position is available. In practice, the quality of the new image is limited by missing or inaccurate disparity information. For this reason, it is desirable to make repeated estimates of the scene structure, resulting in a disparity map for each of several pairs of views. A disparity map defines the position in each of the two source images of at least one given feature whose position varies with viewpoint. If the scene remains fixed, then it should be possible to combine the different depth estimates.
Video input from a single, moving camera can be used to estimate the structure of a scene. For example, Matthies, Kanade & Szeliski, “Kalman Filter-based Algorithms for Estimating Depth from Image Sequences”, International Journal of Computer Vision 3, pp. 209-236, 1989, show that if the camera motion is known, optical flow information can be used to make reliable estimates of the scene depths. Matthies et al. use a Kalman filter to combine estimates based on successive pairs of frames. The uncertainty of each estimate is obtained from the residual error of the optical flow matching procedure. This information is used to make an optimal combination of the individual estimates, subject to an appropriate model of the image noise. If the camera motion is limited to horizontal translation, the video stream can be treated as a series of stereo image pairs with very small separations.
Okutomi & Kanade, “A Multiple Baseline Stereo”, IEEE Trans. Pattern Analysis and Machine Intelligence 15(4), 353-363, 1993, disclose a depth estimation algorithm that uses multiple stereo image pairs with different separations. One fixed view is paired with a series of other images, each taken from a different position. This arrangement produces an input which is qualitatively different from the paired video frames used by Matthies et al, as the latter do not contain a fixed view. Another difference is that, rather than combining disparity estimates, the Okutomi & Kanade algorithm combines the evidence for such estimates from the different image pairs. The integrated evidence is then used to make a final decision about the scene structure.
The Okutomi & Kanade algorithm is based on a simple search procedure. The disparity of a given ‘source’ point is estimated by matching it to the most similar ‘target’ point in the other image. In more detail, regions of pixels are defined around the source and target points. The sum of squared colour differences between the source region and each target region is then computed. The underlying disparity is estimated by searching for the lowest value of this function, assuming that the correct target point minimizes the squared difference. Since the cameras are parallel, the search is only performed in the horizontal direction, resulting in a 1-D function at each image point. As is well known, there are several problems with this approach. Firstly, the correct target point may not be associated with the lowest matching error, meaning that there may be false minima in the disparity function. Secondly, it may be impossible to determine the precise location of the true match, meaning that the minimum of the disparity function may be poorly defined.
Okutomi & Kanade show that these problems can be countered by using a range of different camera separations. The point matching costs for each image pair are computed with respect to ‘inverse depth’, which can be defined as disparity divided by camera separation. It follows that the resulting functions, one for each stereo image pair, will share a single parameterisation. This means that the errors can be added together, and that the true inverse depth of a given point can be estimated from the minimum of the composite function. Okutomi & Kanade show that this procedure has two important consequences. Firstly, false minima in the individual matching functions tend to be suppressed in the composite function. Secondly, the true minimum tends to become more well-defined as the individual functions are added.
Szeliski & Golland, Microsoft Corp: “Method for Performing Stereo Matching to Recover Depths, Colors and Opacities of Surface Elements”, 1997, U.S. Pat. No. 5,917,937, disclose another multi-view image representation. This involves mapping a collection of images (typically three or more) into a common coordinate system. The necessary projective transformations can be derived from the positions, orientations and internal parameters of the original cameras. Each point in the common image coordinates is associated with a range of possible scene depths. Each scene depth is in turn associated with a colour from each input view. This representation is a generalized disparity space which extends the two view structure used by Marr & Poggio, “Cooperative computation of stereo disparity”, Science 194, 283-287, 1976.
Rather than using the different images to estimate a disparity map in the common coordinate system, Szeliski & Golland render a novel view directly. This is achieved by measuring the mean and variance of the colours at each point in the disparity space. The appearance of each scene point is expected to be consistent across the different input images and so the corresponding variances should be low. The mean colour at each point is associated with an opacity which is inversely proportional to the variance. Szeliski & Golland show that pixels in a new view can be estimated by compositing the opacity-weighted mean colours along each disparity ray.
Leclerc, Luong and Fua, “Measuring the Self-Consistency of Stereo Algorithms”, Proc. European Conference on Computer Vision 2000, pp. 282-298, disclose a procedure for measuring the self consistency of stereo disparity maps. This is intended as a means of evaluating binocular correspondence algorithms, and of determining appropriate parameter settings. It is assumed that if a number of disparity maps are estimated from different images of the same scene, their mutual consistency will be representative of their accuracy. This assumption means that no ground-truth data is required by the evaluation procedure. As in the Szeliski & Golland rendering scheme, the camera parameters are used to map the images into a common coordinate system. A matched pair of points, one from each of two images, defines a single point in the scene. A subsequent match between one of the pair, and another point, from a third image, should define the same scene-point. The Leclerc, Luong & Fua algorithm evaluates this consistency condition over a set of disparity maps obtained from images of a single scene.
Viola & Wells, “Alignment by Maximisation of Mutual Information”, International Journal of Computer Vision 24(2), pp. 137-154, 1997, describe an algorithm that can be used to bring images and 3-D models into registration. This is achieved by optimizing the mutual information between the data sets that are being aligned. The advantage of the mutual information measure is that each data set can measure a different function of the underlying structure. For example, a 3-D model can be aligned to an image by maximizing the mutual information between the surface-normal vectors and the pixel intensities. This can be achieved despite the lack of a clear definition of distance between the normals and the intensities.