Automatic large-scale three-dimensional (3D) reconstruction of urban environments using data obtained from ground reconnaissance video or active sensors like LIDAR (Light Detection And Ranging, an optical remote sensing technology that measures properties of scattered light to find range and/or other information of a distant target) is a very active research topic at the intersection of computer vision and computer graphics. Such techniques are discussed, for example, in N. Cornelis et al., 3d Urban Scene Modeling Integrating Recognition And Reconstruction, International Journal of Computer Vision (2008), C. Frith et al., Data Processing Algorithms For Generating Textured 3d Building Facade Meshes From Laser Scans And Camera Images, International Journal of Computer Vision (2005), M. Pollefeys et al., Detailed Real-Time Urban 3d Reconstruction From Video, Intl. Journal of Computer Vision (2008) (“Polleyfeys I”), L. Zebedin et al., Fusion Of Feature-And Area-Based Information For Urban Buildings Modeling From Aerial Imagery, European Conference on Computer Vision (2008), and J. Xiao et al., Image-Based Street-Side City Modeling, SIGGRAPH Asia (2009). The applications of these techniques are very broad, from augmenting maps as in Google Earth or Microsoft Bing Maps, to civil and military planning, and entertainment.
One important aspect of most current research efforts is obtaining the computational efficiency that can enable the modeling of wide area urban environments, such as entire cities. The data sets resulting from data collection of such areas may be massive. Even a small town may require millions of frames of video just to capture the major streets. The reconstruction algorithms employed should be fast in order to finish processing in a reasonable amount of time. Additionally, the generated models should preferably be compact so that they can be efficiently stored, transmitted, and/or rendered.
There are typically two parts to a system that performs three dimensional reconstruction from video. The first part performs camera motion estimation from the video frames, commonly called Structure from Motion (SfM) or sparse reconstruction, due to the fact that as a byproduct it determines the three dimensional positions of salient feature points of the environment from the video frames. The second part performs the so called dense reconstruction, which obtains a dense scene geometry using the known camera positions and the video frames. As will be appreciated, “dense” three dimensional reconstruction attempts to reconstruct a three dimensional geometry using many points of an image, and is contrasted with sparse reconstruction, which only uses a few characteristic features of the image for three dimensional reconstruction. In some approaches, such as the approach described in Polleyfeys I, the robustness and drift of sparse estimation can be improved through additional sensors, such as inertial navigation systems (INS) and global positioning system (GPS) sensors, which may also remove the ambiguity in scale inherent in Structure from Motion (SfM), and provide an absolute coordinate system. The dense reconstruction subproblem involves performing stereo matching, depth map fusion, or other means to generate a three dimensional model, as for example LIDAR, as discussed in C. Früh et al., Data Processing Algorithms For Generating Textured 3d Building Facade Meshes From Laser Scans And Camera Images, International Journal of Computer Vision (2005).
Three dimensional scanning technologies include contact, ground-based surveying, ground-based LIDAR, aerial LIDAR and radar, aerial surveying, aerial stereo, ground reconnaissance video surveying, and others.
Three-dimensional reconstruction systems may include manual systems and user-assisted systems, such as the commercial products PhotoModeler, and Pictometry, as well as fully automatic research systems such as the hand-held modeling techniques described in Polleyfeys I.
Three-dimensional reconstruction systems are typically targeted toward city modeling, and include ground-based LIDAR as discussed in C. Früh et al., Data Processing Algorithms For Generating Textured 3d Building Facade Meshes From Laser Scans And Camera Images, International Journal of Computer Vision (2005), ground and aerial LIDAR as discussed in C. Früh et al., An Automated Method For Large-Scale, Ground-Based City Model Acquisition, International Journal of Computer Vision, Vol. 60, No. 1, (October 2004), pp. 5-24, and ground-based stereo techniques, such as Urbanscape discussed in Pollefeys I and WikiVienna as described in A. Irschara et al., Towards Wiki-Based Dense City Modeling, Workshop on Virtual Representations and Modeling of Large-scale environments (2007).
Furthermore, range-image fusion techniques such as Poisson blending and Visibility-based techniques have been described, such as in P. Merrell et al., Real-Time Visibility-Based Fusion Of Depth Maps, ICCV (2007).
In particular, a number of three dimensional reconstruction methods that address the dense reconstruction subproblem have been proposed. Multi-view stereo is a passive sensing technique which uses multiple photographs or video frames with known camera positions to measure depth. The depth of each pixel in the image is recovered and stored in a depth map. A taxonomy of stereo algorithms is given by D. Scharstein et al., A Taxonomy And Evaluation Of Dense Two-Frame Stereo Correspondence Algorithms, Intl. Journal of Computer Vision (2002). An overview of multi-view stereo for object centered scenes is given in S. Seitz et al., A Comparison And Evaluation Of Multi-View Stereo Reconstruction Algorithms, Computer Vision and Pattern Recognition (2006). Three dimensional reconstruction from video has been addressed in M. Pollefeys et al., Visual Modeling With A Handheld Camera, Intl. Journal of Computer Vision (2004), which uses uncalibrated hand-held video as input and obtained reconstructions of hundreds of frames, but could not handle wide-area scenes. The system presented by Pollefeys I was designed to process wide-area scenes. The resulting three dimensional models are generally three dimensional surfaces represented as texture-mapped polygonal meshes, which may result in problems, such as holes in homogeneous areas and windows, and slightly inaccurate geometry on facades deviating from the true planar geometry, which may cause visually disturbing artifacts.
There are several recent approaches deploying simplified geometries, as described, for example, in N. Cornelis et al., Fast Compact City Modeling For Navigation Pre-Visualization, Computer Vision and Pattern Recognition (2006) (“Cornelis I”), N. Cornelis et al., 3d Urban Scene Modeling Integrating Recognition And Reconstruction, International Journal of Computer Vision (2008) (“Cornelis II”), Y. Furukawa et al., Manhattan-World Stereo, Proceedings IEEE CVPR (2009), and S. N. Sinha et al., Piecewise Planar861 Stereo For Image-Based Rendering, Proceedings IEEE ICCV (2009). Cornelis I uses simple U-shaped ruled surface model to efficiently produce compact street models. To enhance the appearance of cars and pedestrians not modeled through the ruled surface model, Cornelis II extends the approach to detect and replace those items through explicit template models. Furukawa et al. uses a very specific Manhattan-world model, where all planes must be orthogonal, and Sinha et al. uses a general piecewise planar model. Non-planar surfaces are either reconstructed with a staircase appearance or are flattened to nearby planes.
Other depth map fusion techniques, such as those discussed in C. Zach et al., A Globally Optimal Algorithm For Robust Tv-L1 Range Image Integration, International Conference on Computer Vision (2007 March), use an occupancy grid for depth map fusion but require the occupancy grid to be present in memory, leading to limitations on the model resolution.
Several methods have been aimed directly at modeling buildings from street-side imagery. J. Xiao et al., Image-Based Street-Side City Modeling, SIGGRAPH Asia (2009) present an automatic method that learns the appearance of buildings, segments them in the images, and reconstructs them as flat rectilinear surfaces. The modeling framework of Xiao et al. only supports automatic reconstruction and does not attempt to geometrically represent other scene parts, such as vegetation, that may typically be present in many urban scenes.