Conventionally, a computer's rendering engines may be configured to provide automatic feature matching or feature extraction that recovers camera calibrations and a sparse structure of a scene from an unordered collection of two-dimensional images. The conventional rendering engine may use the camera calibrations, sparse scene structure, and the two-dimensional images to triangulate the location of each pixel in a three-dimensional space. The pixels are rendered in the triangulated locations to form a three-dimensional scene.
However, the quality and detail of the generated three-dimensional scenes often suffers from various drawbacks. For instance, the conventional rendering engines may render textureless or non-Lambertian surfaces captured in the two-dimensional images as holes. The holes are covered by interpolating the depth of neighboring pixels. But the conventional rendering engine's interpolation may erroneously reproduce flat surfaces with straight lines as bumpy surfaces in the three-dimensional scene. The conventional rendering engine may also erroneously introduce jaggies in the three-dimensional scene because of unreliable matching of the non-Lambertian surfaces, occlusions, etc.
The quality and detail of the conventional rendering engines significantly degrades when generating three-dimensional scenes of architectural scenes, urban scenes, or scenes with man-made objects having plentiful planar surfaces. Moreover, the reconstruction of the three-dimensional scene from a sparse collection of two-dimensional images is not navigable in a photorealistic manner because the assumptions of the conventional computer vision algorithms executed by the conventional rendering engines are not designed to work well for scenes containing man made surfaces.