In computer vision, a collection of images may be processed to simultaneously recover camera pose and structure of the scene, such as to recover three-dimensional (3D) information about the scene for various applications. The problem of estimating scene structure (3D geometry) and camera motion (camera pose) from multiple images of a scene is referred to as structure from motion (Sfm).
Most vision-based structure from motion systems are sequential, starting with a small reconstruction of a scene with two cameras, then incrementally adding in new cameras one at a time by pose estimation, and 3D points by triangulation. This is followed by multiple rounds of intermediate bundle adjustment (robust non-linear minimization of the measurement/re-projection errors), and removal of outliers (erroneous measurements) to minimize error propagation as the reconstruction grows.
The sequential approach to structure from motion is computationally expensive for large image collections. The sequential approach also can suffer from the problem of accumulation of drift as errors compound. This makes a reconstructed scene appear incorrect, e.g., what is actually a square corner appears to be somewhat rounded and at something other than ninety degrees. What is desirable is computing a direct initialization (estimates for camera poses and structure) in an efficient and robust manner, without any intermediate bundle adjustment, (allowing for a final bundle adjustment for the complete structure and all the cameras).