1. Field of the Invention
This invention relates to a three-dimensional structure estimation apparatus which measures a depth distance of an object on an image and outputs a depth image in application fields in which a depth distance to an object on a image is estimated in the field of computer vision including such fields of supervision of an object, automatic operation and robot automation.
2. Description of the Related Art
In the field of computer vision, a stereo method is utilized popularly as a method of obtaining three-dimensional information from two-dimensional information. The stereo method is a useful technique for obtaining three-dimensional information from paired two-dimensional images. One of such techniques is disclosed, for example, in M. Okutomi and T. Kanade, “A multiple-baseline stereo”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 4, April, 1993, pp.353-363 (reference document 1). The technique of the reference document 1 is devised so as to allow processing in a comparatively short calculation time comparing with other stereo methods.
A representative construction of a three-dimensional structure estimation apparatus which employs a conventional stereo method is shown in FIG. 6.
Referring to FIG. 6, a pair of cameras 600 and 601 having a same visual field are placed in a spaced relationship from each other on one baseline 602. The cameras 600 and 601 have optical axes 603 and 604, respectively, which intersect with each other at one point as seen from thick solid lines in FIG. 6.
Meanwhile, a visual field range 605 of the camera 600 is indicated by broken lines. The angular aperture defined by the broken lines is defined as a visual field of the camera 600. Similarly, the angular aperture of a visual field range 606 of the camera 601 is a visual field of the camera 601. The three-dimensional structure estimation apparatus is based on the principle of triangulation that the distance is based on in which directions a point on the surface of an object is observed from the positions of the paired cameras 600 and 601 in a region defined by the visual field ranges 605 and 606 of the cameras 600 and 601 positioned at the stereo positions.
Investigations for the stereo method are continued also at present, and another method is disclosed, for example, in A. Luo and H. Burkard, “An intensity-based cooperative bidirectional stereo matching with simultaneous detection of discontinuities and occlusions”, International Journal of Computer Vision, No. 15, 1995, pp. 171-188 (reference document 2).
In a basic stereo method, coordinate positions of a certain location is searched, such as one point in a three-dimensional space corresponding to images of different cameras, based on suitable coincidence degrees of characteristics and pattern distributions of the images. Here, it is measured by which amounts the locations on the images corresponding to the same point in the three-dimensional space are displaced and the depth distance of the point is calculated from the measured amounts and the positions and the directions of the cameras. The amount of the displacement of each position on the corresponding image is defined as disparity.
Various conventional stereo methods are characterized in variation in amount (brightness, edge, texture and so forth) used upon searching of a location corresponding to a same point in a three-dimensional space, handling of any region which is behind an object and cannot be seen from paired cameras, handling of an image in which very similar patterns appear periodically, and so forth.
Handling of a region which is behind an object and cannot be seen from paired cameras is disclosed, for example, in D. Geiber, B. Landendorf and A. Yuille, “Occlusions and binocular stereo”. International Journal of Computer Vision, No. 14, 1995, pp.211-226 (reference document 3).
Meanwhile, hardware constructions used for stereo methods do not have many variations.
A first variation is to increase the number of cameras to be used from two, which is a standard number, to three or more. This technique is disclosed, for example, in S. B. Kang, J. Webb, C. Zitnick and T. Kanade, “An active multibaseline stereo system with real-time image acquisition”. Image Understanding Workshop, 1994, pp.1,325-1,335 (reference document 4).
It is to be noted that a technique which uses such a construction as just described but proposes a different algorithm is disclosed, for example, in I. J. Cox, “A maximum likelihood n-camera stereo algorithm”, International Conference on Pattern Recognition, 1994, pp.437-443 (reference document 5).
A second variation is to multiplex a plurality of images which are different in time or space using a plurality of reflecting mirrors so as to allow application of a stereo method only with a single camera. This technique is disclosed, for example, in W. Teoh and X. D. Zhang, “An inexpensive stereoscopic vision system for robots”. Proc. Int. Conf. Robotics, 1984, pp.186-189 (reference document 6).
Further, a technique wherein images from two positions are multiplexed and introduced into a single camera by reflecting mirrors is disclosed, for example, in A. Goshtasby and W. A. Gruver, “Design of a single-lens stereo camera system”, Pattern Recognition. Vol. 26, No. 6, 1993, pp.923-937 (reference document 7).
A third variation is to utilize a camera on which a fisheye lens is mounted in order to construct a three-dimensional structure estimation apparatus having a wide visual field. This technique is disclosed, for example, in S. Shah and J. K. Aggarwal, “Depth estimation using stereo fish-eye lenses”, Proc. IEEE International Conference, 1994, pp.740-744 (reference document 8).
In a stereo method, it is necessary that each point on the surface of an object is similar on a plurality of images. Therefore, in conventional systems, two or more cameras of the same type on which same lenses are mounted are arranged comparatively nearly to each other to prevent their output images from becoming much different from each other.
Consequently, the resultant display images have an equal resolution. Further, since the directions of the lines of sight or the positions of the cameras are not much different from each other, from the point of view that an image imaged by a single camera is processed, the difference between the images is comparatively small and information included in the images is very redundant. From this fact, since an additionally provided camera provides only information to be used by a stereo method, it can be considered that much wasteful information is provided by the camera.
Of the various conventional three-dimensional structure estimation apparatus described above, the three-dimensional structure estimation apparatus shown in FIG. 6 has a problem in that, where each of the stereo cameras 600 and 601 which form a stereo pair have only narrow visual fields, it is difficult to measure an imaging object placed in a long depth distance range. The reason is that an imaging target can be imaged by the two cameras only in a common visual field region 607 in which the visual field ranges 605 and 606 of the cameras 600 and 601 overlap with each other and which is a comparatively small space (space defined by thick broken lines in FIG. 6).
The problem just described is discussed in D. H. Ballard and C. M. Brown, “Principles of animate vision”. CVGIP Image Understanding, Vol. 56, No. 1, July, 1992, pp.3-21 (reference document 9).
Further, the common visual field region 607 in which the visual field ranges 605 and 606 of the cameras 600 and 601 overlap with each other looks as if it covers a large distance range between a point at a shortest depth distance 608 from the baseline 602 to the nearest intersecting location between the visual field ranges 605 and 606 and another point at a longest depth distance 610 to the farthest intersecting location as seen in FIG. 6. However, since an imaging target to be measured usually has a certain magnitude, in order to estimate a three-dimensional structure over a range as wide as possible by a single imaging operation, it is most efficient that the object be present at or around the point at a maximum width distance 609.
A possible solution to the problem just described is a stereo system wherein the relative angle between the cameras is adjusted to adjust the maximum width distance 609. Such variation of the relative angle can be realized by mechanically controlling the cameras, for example, using paired electrically controlled motors provided at base portions of the cameras. This, however, gives rise to different problems that the three-dimensional structure estimation apparatus is mechanically complicated and that an error occurs with the position of each camera.
Since camera position information is utilized upon calculation of a three-dimensional position of an object, if an error is included in a camera position, the accuracy in measurement is deteriorated by the error.
On the other hand, where the stereo cameras 600 and 601 paired with each other individually have wide visual fields, while the three-dimensional structure estimation apparatus has a wide measurement range, since the area of the surface of the object per unit pixel on an image is large, the resolution is low and the accuracy upon measurement of the depth distance is sacrificed.
Thus, a wide visual field and a high resolution or a high degree of accuracy in measurement have a relationship of a tradeoff, and the conventional apparatus do not satisfy both of the requirements.