The present invention relates to a method of and an apparatus for 3-dimensional structure estimation which is used for obtaining 3-dimensional information of an object from 2-dimensional image data of the object, and more particularly to those based on triangular surveying making use of multiple sets of 2-dimensional image data of an object taken from multiple viewing positions.
There is a 3-dimensional structure estimation technique called stereo-method, which estimates 3-dimansional structure of an object based on the triangular surveying from multiple sets of 2-dimensional image data taken from multiple viewing positions. A conventional example of the stereo-method is described in a paper entitled xe2x80x9cA Multiple-Baseline Stereoxe2x80x9d by Okutomi et al, IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 353-363, Vol. 15, No. 4, April 1993.
First, principle of the stereo-method is described referring to a schematic diagram of FIG. 6.
Suppose a first camera 10-1, with a lens having a focal distance F, which is positioned on an X-Y plane, perpendicular to the sheet of FIG. 6, so that center of the lens is at coordinates (X1, 0) and optical axis is perpendicular to the X-Y plane, and a second camera 10-2, with its lens having the same focal distance F, which is positioned parallel to the first camera 10-1 so that center of its lens is at coordinates (X2, 0).
Defining the coordinates (X1, 0) and (X2, 0) as viewing positions of the first camera 10-1 and the second camera 10-2, respectively, a distance B =X2xe2x88x92X1 between the two viewing positions is hereafter called the baseline B of the first and the second camera 10-1 and 10-2.
When a first and a second picture of an object 1 are taken by the first and the second camera 10-1 and 10-2 having the baseline B, and a position P of the object 1 is projected at points p1 and p2 of the first and the second picture, that is, on focal planes of the first and the second camera 10-1 and 10-2, respectively, a disparity d between the points p1 and p2 is represented as follows:
d=x2xe2x88x92x1=BF/z,xe2x80x83xe2x80x83(1)
where x1 and x2 are x-components of coordinates of the points p1 and p2 on x-y planes having their origins on the centers of the first and the second picture, respectively, and z is a depth, that is, a distance to the X-Y plane of the position P of the object 1.
Therefore, information of 3-dimensional structure of the object 1 can be estimated from the disparity d if each point p1 of the first picture is known to correspond to which point p2 of the second picture.
In general, the stereo-method is performed according to an algorithm wherein a depth z at an interesting point p1 of the first picture is estimated by retrieving a point p2 of the second picture having correspondence to the interesting point p1, and by repeating above procedure for each point p1 of the first picture, depth of each position P of the object 1 is estimated on the first picture taken by the first camera 10-1.
In many algorithms, the correspondence is discriminated when an evaluation value such as brightness difference between the concerning two points p1 and p2, or sum of brightness differences between two small regions around the concerning two points p1 and p2 becomes minimum in a retrieving range defined as follows. When a possible depth z to be obtained is between zmin to zmax, the disparity d should be between dmin=BF/zmax to dmax=BF/zmin from the equation (1).
Therefore, the corresponding point P2 should be retrieved in a range x1+dminxe2x89xa6x2xe2x89xa6x1+dmax.
In some algorithms, points in the retrieving range showing the evaluation value, brightness difference for example, within a threshold value are selected as candidates of the corresponding point, and one of the candidates which gives the most smooth variation of the depth z is determined as the corresponding point. Further, when there is known an obstacle 2 as illustrated in FIG. 7 in front of the object 1, correspondence retrieved in a range where the obstacle 2 should exists are rejected in many algorithms as correspondence physically impossible.
Returning to the equation (1), the disparity d is in proportion to the baseline B for the same depth z, and preciseness of the disparity d is limited according to the picture resolution. Therefore, the larger disparity d gives the higher precision of the estimated depth z, and the longer baseline B is preferable for the purpose. However, a longer baseline B gives a wider retrieving range as above described, causing a greater possibility of a false correspondence.
Therefore, there is a tradeoff between precision and false frequency of the estimation.
Techniques for dealing with this tradeoff can be classified into two methods. In one method, a coarse estimation is performed by retrieving correspondence between a pair of low resolution images, then a precise estimation is performed with a pair of high resolution images eliminating false correspondence inconsistent with the coarse estimation. Another approach is a method (hereafter called the multi-baseline stereo method) wherein multiple images of an object taken from multiple viewing positions having different baselines are used so that the evaluation value varies greatly according to whether there is correspondence or not.
In the prior paper beforehand mentioned of Okutomi et al., the latter approach, namely, the multi-baseline stereo method is applied.
Now, the multi-baseline stereo-method in the prior paper is described referring to a schematic diagram of FIG. 8.
In FIG. 8, n pictures of an object 1 are taken by a first to n-th cameras 10-1 to 10-n, each having a lens with a focal distance F and positioned at each of viewing positions (X1, 0) to (Xn, 0) on an X-Y plane so as to have optical axis thereof perpendicular to the X-Y plane, n being a positive integer. Each of baselines B1,2 to B1,n is that between the first camera 10-1 and each of the other cameras 10-2 to 10-n. A position P having a depth z of the object 1 is projected at points p1 to pn of the n pictures, x1 to xn being distances of the points p1 to pn in X-direction to centers of the n pictures.
Here, nxe2x88x921 disparities d1,2 to d1,n between nxe2x88x921 pairs of points p1 and p2 to p1 and pn are obtained as follows:                                                                                           d                                      1                    ,                    2                                                  =                                                                            x                      2                                        -                                          x                      1                                                        =                                                            B                                              1                        ,                        2                                                              ⁢                    F                    ⁢                                          /                                        ⁢                    z                                                                                                                                            d                                      1                    ,                    3                                                  =                                                                            x                      3                                        -                                          x                      1                                                        =                                                            B                                              1                        ,                        3                                                              ⁢                    F                    ⁢                                          /                                        ⁢                    z                                                                                                          ⋮                                                                                            d                                      1                    ,                    n                                                  =                                                                            x                      n                                        -                                          x                      1                                                        =                                                            B                                              1                        ,                        n                                                              ⁢                    F                    ⁢                                          /                                        ⁢                    z                                                                                      }                            (        2        )            
Therefore, for a depth estimation z of a position P, correspondence between nxe2x88x921 pairs of points represented by the above equations (2) can be checked, enabling to improve the estimation precision making use of long baselines and reducing false correspondence at the same time.
In the algorithm of the multi-baseline stereo method, a similar step to the algorithm with two cameras described in connection with FIG. 6 of retrieving a corresponding point to an interesting point p1 of the first picture is performed for each of the other pictures taken by the second to the n-th cameras 10-2 to 10-n, and above procedure is repeated for each point of the first picture.
In the algorithm with two cameras, the retrieving range is defined concerning the disparity d. However, in the multi-baseline stereo-method of the prior paper, the retrieving range is defined with an inverse distance 1/z, namely a reciprocal of the depth z, and the corresponding point giving a minimum of an evaluation value is retrieved in each of the other pictures according to the equations (2) by varying the inverse distance from 1/zmax to 1/zmin.
As to the evaluation value, sum of the sums of squared-difference values between small regions of each pair of pictures is applied in the prior paper.
FIG. 9 is a schematic diagram illustrating the small regions 115-1 to 115-n of n pictures of a rectangular solid 3 corresponding to left-upper front corner thereof taken with the first to the n-th cameras 10-1 to 10-n. The sum of squared-difference values between the first small region 115-1 and each of the other small regions 115-2 to 115-n is calculated for the first. Then, a value of the inverse distance 1/z which makes minimum the total value of nxe2x88x921 sums thus calculated is retrieved between 1/zmax to 1/zmin. This procedure is performed for every point of the first picture take by the first camera 10-1.
Thus, the multi-baseline stereo-method of the prior paper is performed.
However, when there is a large disparity, there may arise an extreme difference between a pair of small regions, such as the pair of the small regions 115-1 and 115-n of FIG. 9, although both representing the same corner. In such a case, the calculated value of the inverse distance 1/z may be shifted by the extreme difference, in the multi-baseline stereo-method of the prior paper.
In a Japanese patent application laid open as a Provisional Publication No. 329481/""92 entitled xe2x80x9cA Method of and an Apparatus for Obtaining 3-Dimensional Dataxe2x80x9d, there is disclosed a method of estimating 3-dimensional structure to be applied even when there is a large disparity between a pair of stereo pictures.
In this prior art, variation of a correlation value between two small regions is calculated varying the disparity. When there can not be found a clear singular point in the correlation value, revision of size and scope of the small regions or distortion of one of the small regions, for example, is performed according to pattern of variation curve of the correlation value.
In the examples heretofore described, a sum of squared-difference of pixel brightness or a correlation value between small regions is used as the evaluation value for discriminating corresponding points in the stereo pictures. Beside these values, there are known stereo-methods making use of difference of edge lines or texture information as the evaluation value.
Problems in these proir arts are as follows.
First, in methods to compare small regions of pictures taken by a pair of cameras, correspondence of the small regions may not be discriminated correctly because of the large difference of viewing angle, when the baseline of the pair of cameras is large. In the method disclosed in the Japanese patent application Provisional Publication No. 329481/""92, revision of size and scope of the small regions or distortion of one of the small regions is performed for dealing with this problem. However, the revision or the distortion requires somewhat ad hoc technique and it is very difficult to establish rules for the revision or distortion widely applicable. Therefore, it can be said that there was a limit of the baseline with the conventional methods for discriminating correspondence referring the small regions.
Second, difference of brightness because of variation of reflectivity according to difference of viewing angle is not considered in the prior arts.
When pictures of an object are taken by cameras from different viewing position, brightness of a point of the object differs generally in each picture owing to difference of viewing angle as illustrated in FIG. 10. In FIG. 10, brightness of a point P of an object 1 illuminated by a light 7 becomes highest in a direction symmetric to the light 7 for the normal line of the point P, and varies according to viewing direction, that is, angle to the viewing position. Therefore, when the corresponding points is discriminated by evaluating simply the sum of squared-difference of pixel brightness between small regions, it is easily affected with the above variation of reflectivity, and so, does not become sufficiently small even at the corresponding point, resulting in an increase of the estimation errors.
The effect of the reflectivity variation may be reduced by applying the correlation value, or the difference of edge lines or texture information as the evaluation value. However, these values should be calculated from the small regions, and so, are not free from the first problem which limits the baseline length, and accordingly, the estimation precision.
Therefore, a primary object of the present invention is to provide a method of and an apparatus for 3-dimensional structure estimation wherein a high estimation precision and a high estimation reliability are both realized at the same time.
In order to achieve the object, a method of 3-dimensional structure estimation of the invention for estimating a 3-dimensional structure of an object from image data of a plurality of pictures of the object each taken from each viewing position ranged on a straight line by a camera with an optical axis parallel to a direction perpendicular to the straight line has a step of performing, for each pixel of image data of a first of the plurality of pictures, steps of:
extracting corresponding small regions, having a size of at least one pixel, each from the image data of each of the plurality of pictures, a position of each of the corresponding small regions in corresponding each of the plurality of pictures being defined by a focal distance of the camera, a distance between a viewing position wherefrom the corresponding each of the plurality of pictures is taken and a viewing position wherefrom the first of the plurality of pictures is taken, a position of a concerning pixel of image data of the first of the plurality of pictures, and a variable representing a depth of a point of the object corresponding to the concerning pixel;
calculating a neighboring correspondence value for each of the corresponding small regions, the neighboring correspondence value representing correspondence among the corresponding small regions of neighboring certain of the plurality of pictures, viewing positions wherefrom the neighboring certain are taken being ranged within a predetermined distance from a viewing position wherefrom a picture including said each of the corresponding small regions is taken;
obtaining a sum of the neighboring correspondence value of all of the corresponding small regions; and
selecting an estimation value in a predetermined range of the variable which gives a singular value of the sum of the neighboring correspondence value, and outputting the estimation value as an estimation of the depth of the point corresponding to the concerning pixel.
Therefore, the first problem of the prior arts beforehand described that the correspondence of the small regions may not be discriminated correctly because of the extreme difference thereof due to large difference of viewing angle can be eliminated in the invention, enabling to obtain still higher estimation precision by enlarging the baseline length.
Further, the neighboring correspondence value is so calculated as to represent relative differential of concerning pixel values, such as a variance, for example, of pixel values in the corresponding small regions of the neighboring certain of the plurality of pictures.
Therefore, the second problem of the prior arts that the correspondence estimation is easily affected with the variation of reflectivity owing to difference of viewing angles can be also reduced greatly in the invention, resulting in still higher estimation reliability.