Computer vision-based three-dimensional reconstruction refers to capturing images by means of a digital camera or a video camera and constructing an algorithm to estimate three-dimensional information of the captured scenario or object, so as to achieve an aim of expressing a three-dimensional objective world, whose application range includes robot navigation, motor vehicle piloted driving or aided driving, virtual reality, digital media creation, computer animation, image-based rendering, cultural heritage conservation and the like.
Currently, Structure from Motion (SFM) is a commonly used three-dimensional reconstruction method, which estimates three-dimensional information of a scenario or object based on two or more images or videos. An existing technical means for realizing SFM three-dimensional reconstruction has the following characteristics: feature point-based, sparse and two-step. Existing SFM three-dimensional reconstruction is accomplished in two steps: firstly, detecting, from an image, and matching feature points with invariances of a scale or an affinity and the like, which include a Harris feature point, a Kanade-Lukas-Tomasi (KLT) feature point and a Lowe scale invariant feature transform (SIFT) point, and then, estimating three-dimensional information of the detected amount of features and a pose (including a location and an angle) of a camera.
An existing SFM three-dimensional reconstruction algorithm is accomplished in two steps, such that an optimized effect cannot be really achieved. Because two-dimensional coordinates of the feature points detected from the image have errors, on this basis, an overall optimized result cannot be obtained even if its three-dimensional information is reconstructed by an optimization algorithm. A matching accuracy of the feature points is generally lower, thereby causing three-dimensional reconstruction with a low accuracy inevitably.
An effect of the three-dimensional reconstruction is sparse; and because its three-dimensional information is estimated only for the extracted feature points, dense three-dimensional reconstruction cannot be realized, that is, the three-dimensional information of all pixel points cannot be estimated. As for a 480*640 image of 0.3 mega-pixels, on a premise of ensuring a certain correct matching ratio, only 200˜300 or even less feature points can be generally detected, the feature points are very sparse with respect to the image of 0.3 mega-pixels, and no three-dimensional information of most of pixels has been directly estimated. Further, although three-dimensional information of other points may be further estimated by using a technical means such as an estimated epipolar constraint based on the feature points, to realize dense or quasi dense reconstruction, the effect of three-dimensional estimation of other subsequent points are influenced because the three-dimensional information of the estimated feature points and the pose of the camera have a certain errors.