3D perception is a very important aspect of human vision. While human beings can perceive 3D information effectively and effortlessly, it is still quite hard for a computer to extract a 3D model out of natural scenes automatically. Furthermore, whilst using 3D perception to imagine scenes from a slightly different angle is also effortless for a human being, the similar operation for a machine is fundamentally dependent upon the extraction of a suitable 3D model which the computer may then use to generate another image of the scene from the different angle.
The problem of the extraction of 3D structure information of scenes from images of the scene has previously been attempted to be solved by using various kinds of cues: stereo, motion, shading, focus/defocus, zoom, contours, texture, range data, and even X-ray. Among these, stereo vision has been studied most extensively mainly due to its effectiveness, applicability, and similarity to the human vision system.
FIG. 1 shows a typical stereo configuration. IPL (IPR) is the image plane of the left (right) camera. OL (OR), called optical centre, is the centre of the focus of the projection. The line LOL (LOR), through OL (OR) and perpendicular to the image plane IPL (IPR), is named optical axis. The intersection of the optical axis LOL (LOR) and image plane IPL (IPR) is OL (OR), which is called principle point or image centre. The distance between OL (OR) and OL (OR) is the focal length fL (fR) of the left (right) camera. The line Le goes through both OL and OR. The left (right) epipole eRL (eLR) is the projection of OR (OL) into the left (right) camera. For a 3D scene point P, its projection in the left (right) camera is pL (pR). The plane determined by P, OL, and OR is called the epipolar plane of P. The intersection EpL (EpR) of this plane with IPL (IPR) is named the epipolar line. It is easy to check that the epipolar line E.sub.pL (EpR) must go through the epipole eRL (eLR).
FIG. 2 shows a typical videoconferencing arrangement which embodies the stereo set-up of FIG. 1. A user is seated upon a chair at a table 20, directly facing a screen 22 which displays an image of the other video-conferences at the other end of a communications link. Disposed around the edge of the screen 22 are a plurality of cameras 24 facing the user, and arranged to capture images of the user. The images from any two or more of the cameras can be used as the stereo images required to extract 3D information.
Employing a converging stereo set-up as shown in FIG. 1 or 2, traditionally the problem of 3D structure reconstruction is solved by following three typical steps:                1. Stereo calibration: Calculating the stereo internal physical characteristics (the intrinsic parameters) and the 3D position and orientation (the extrinsic parameters) of the two cameras with respect to a world coordinate system or with respect to each other using some predefined objects (the passive calibration) or auto-detected features (the self-calibration);        2. Correspondence estimation: Determining for each pixel in each image the corresponding pixels in the other images of the scene which represent the same 3D scene point at the same point in time; and        3. 3D reconstruction: By triangulation, each 3D point can be recovered from its two projections into the left and right camera.Out of these three steps the most challenging has proven to be the step of correspondence estimation. There are several main difficulties in obtaining correspondence estimation to a suitable accuracy:        1. Inherent ambiguity due to the 2D search within the whole image space;        2. Occlusions: Some parts of the 3D scene can not be seen by both cameras, and hence there will be no corresponding matching pixel in the other image;        3. Photometric distortion: The projection of a single 3D point into the two or more cameras appears with different image properties. An example of such a distortion is specular reflection of the scene light source into one of the cameras but not any of the others. In such a case the apparent intensity of light reflected from the 3D scene point would be much greater in the view which was suffering from specular reflections than in the other view(s), and hence matching of corresponding pixels between the images is made almost impossible; and        4. Projective distortion: The shape of the same 3D object changes between the stereo images e.g. A circular object will appear circular to a camera directly facing it, but elliptical to another camera at an oblique angle thereto.        
Fortunately, the first difficulty of inherent ambiguity can be avoided to a certain degree by using the epipolar geometry, which means that, for a given pixel (e.g. pL in FIG. 1) its corresponding pixel (e.g. pR) in another image must lie on the epipolar line (e.g. EpR). The position of this epipolar line can be accurately computed, through using parameters about the camera set-up, e.g. by intersecting the epipolar plane (e.g. formed by pL, OL, and OR) with another images plane (e.g. IPR). Thus the 2D search is simplified to a 1D search problem. More conveniently, in the stereo set-up, it is possible to rectify the pair of stereo images so that the conjugate epipolar lines are collinear and parallel to the horizontal axis of the image plane, as described in A. Fusiello, E. Trucco and A. Verri. A Compact Algorithm for Rectification of Stereo Pairs. Machine Vision and Applications. 12. pp. 16-22. 2000. In this case, the two cameras share the same plane and the line connecting their optical centres is parallel to the horizontal axis. This stereo set-up is called parallel stereo set-up. After the rectification, the 2D correspondence problem is further simplified into a 1 D search along the epipolar line as a scanline. This searching process is commonly referred to as disparity estimation.
For solving the correspondence (or disparity) estimation problem, three issues should be addressed:                1. What kind of elements are used for matching;        2. What form of measurements should be employed;        3. How should the image searching process be performed.        
Various kinds of matching elements have been used, including sparse image features, intensity block centred at a pixel, individual pixels, and phase information. The form of similarity measurements previously used depends largely on the matching elements used, for example, correlation is usually applied on block matching while distance between feature descriptors has been used for judging the feature similarity. With respect to the searching processes previously used, there have been two previous types. One is the performance of global optimisation, by minimising a certain cost function. The optimisation techniques employed include dynamic programming, graph cut, and radial basis function, etc. Another choice is the “winner-take-all” strategy within a given limited range. For a detailed discussion about classification of stereo matching, please refer to B. J. Lei, Emile A. Hendriks, and M. J. T. Reinders. Reviewing Camera Calibration and Image Registration Techniques. Technical report on “Camera Calibration” for MCCWS, Information and Communication Theory Group. Dec. 27, 1999.
In the stereo vision case, the correspondence estimation problem is usually called stereo matching. With the parallel stereo set-up using as described previously (whether obtained either by image rectification or the geometry of the image capture apparatus), the stereo matching is simplified into a 1D disparity estimation problem, as mentioned previously. That is, given a pair of stereo views IL(x,y) and IR(x,y) coming from a parallel set-up, the disparity estimation task aims at estimating two disparity maps dLR(x,y) and dRL(x,y) such that:IL(x,y)=IR(x+dLR(x,y),y)  Eq. 1IR(x,y)=IL(x+dRL(x,y),y)  Eq. 2
The nature of the disparity maps dLR(x,y) and dRL(x,y) will become more apparent by a consideration of FIGS. 3 and 4.
In order to provide ground-truth information to gauge the performance of both any prior art methods of disparity estimation and the method to be presented herein according to the present invention, we have created a pair of synthetic stereo images shown in FIGS. 3a and 3b by using ray tracing from real images. The synthetic 3D scene consists of one flat ground-plane, and three spheres located at different distances. Four real images are then mapped onto these four surfaces, the most apparent being that of the image of a baboon's face which is mapped onto the spherical surface in the foreground. In addition to using ray tracing to produce the synthetic stereo pair of FIGS. 3a and 3b, the ray tracing technique was also employed to produce a middle view as shown in FIG. 4(b), as well as a ground truth left to right disparity map as shown in FIG. 4a. The disparity map contains a respective displacement value d for each respective pixel in FIG. 3a (which represents the left stereo view) which when applied to the position (x,y) of a respective pixel gives the position (x+d,y) of its corresponding pixel in the right stereo view of FIG. 3b. That is, as will be apparent from the equations 1 and 2 given previously, the intensity value of each pixel in the disparity map gives the displacement required to get from a first pixel in the (left) view to the corresponding pixel in the (right) other view. In this respect, while a disparity map can be conveniently displayed as an image, and is done so in FIG. 4a, it can more rightly be considered as simply a matrix of displacement values, the matrix being the same size as the number of the pixels in each stereo image, such that the matrix contains a single one dimensional displacement value for each pixel in one of the stereo images.
Furthermore, it should also be noted that between any pair of stereo images two disparity maps are usually generated, a first map containing the displacement values in a first direction to obtain the displacements from the left to the right image, and a second map containing displacement values representing displacements in the opposite direction to provide pixel mappings from the right to the left images. In theory the respective values between a particular matched pair of pixels in the left and right images in each of the left to right and right to left disparity maps should be consistent, as will be apparent from equations 1 and 2.
In order to provide for a later comparison with the results of the present invention to be described, the disparity estimation results provided by two existing disparity estimation methods, being those of hierarchical correlation and pixel based dynamic programming will now be described. The results comprise a disparity estimation map together with a synthesised middle view using the matching information thus obtained for each algorithm, as respectively shown in FIGS. 5 and 6. More particularly, FIG. 5a shows the left to right disparity map generated by the hierarchical correlation algorithm, and FIG. 5b illustrates the synthesised middle view using the disparity information thus obtained. FIG. 6a illustrates the left to right disparity map obtained using the pixel based dynamic programming method, and FIG. 6b illustrates the synthesised middle view generated using the disparity information of FIG. 6a. In both cases it can be seen that problems exist in the region of the baboon's nose, in that incorrect correspondence estimation between respective pixels of the two stereo images which represent this feature has led to the anomalies in each disparity map, and hence the problems in the synthesised middle view images. The exact anomalies generated by the prior art algorithms when applied to the ground truth stereo image pair of FIG. 3 will become apparent by comparing FIGS. 5 and 6 respectively with the ground truth images of FIG. 4.
R. Szeliski. Stereo algorithms and representations for image-based rendering in British Machine Vision Conference (BMVC'99), volume 2, pages 314-328, Nottingham, England, September 1999 contains a very good review about other disparity estimation methods particularly used for image based rendering purposes, and further experimental comparisons are given in R. Szeliski and R. Zabih. An experimental comparison of stereo algorithms, Vision Algorithms 99 Workshop, Kerkyra, Greece, September 1999. Compared with feature, pixel, and frequency-based methods, it seems that a block matching approach combined with a “winner-take-all” strategy can be performed with sufficient quality of disparities in real time (see Changming Sun, A Fast Stereo Matching Method, Digital Image Computing: Techniques and Applications, Massey University, Auckland, New Zealand, 10-12 Dec. 1997), which is crucial for many applications such as teleconference systems. However, in order to obtain a better quality of results, there still exist two major difficulties:                1. Choosing an appropriate window size. The larger the window size, the more robust against noise and the smoother the disparity maps are, however some details will be lost, and also discontinuities may also be smoothed, vice verse. Essentially, the size of window should grasp the most important spatial scale of the images being dealt with. Various approaches have been attempted in the past to optimise the window size, with varying degrees of success, in particular see James J. Little, Accurate Early Detection of Discontinuities. Vision Interface 92, and T. Kanade and M. Okutomi. A Stereo Matching Algorithm with an Adaptive Window:Theory and Experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-16. pp. 920-932. 1994.        2. Projective distortion: As mentioned previously, the perspective projection of the cameras generally makes the presence of a 3D object in the stereo pair different. Traditionally, within the disparity estimation art this issue was addressed by taking into account the slanted surface, which was tolerated by up-sampling the stereo images in advance (see P. A. Redert, C. J. Tsai, E. A. Hendriks, and A. K. Katsaggelos. Disparity estimation with modeling of occlusion and object orientation. Proceedings of the SPIE conference on Visual Communications and Image Processing (VCIP), volume 3309, pages 798-808, San Jose, Calif., USA, 1998) or non-linear diffusion (see Szeliski, R. and Hinton, G. E. Solving random-dot stereograms using the heat equation. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, San Francisco. 1986). However, projective distortion generally changes the appearance of a 3D object in the two stereo images (e.g. the curvature of texture on the surface of a sphere) with the result that most block-based methods fail to match corresponding pixels and features between images.        
Traditionally, only one or the other of the above two issues have been previously addressed within existing correspondence estimation algorithms, and not both at the same time.