Structure from motion (SfM) refers to the process of estimating three-dimensional structures of the scene and motion of the cameras from point correspondences in two-dimensional image sequences. Camera motion refers to camera internal (namely focal length, radial distortion and principal point) and external (camera position and orientation) parameters.
Single features of the object are expected to be present in more than one image. For example, a feature at position x1 in the first image I1 may be detected at position x2 and x3 in the second and third images I2 and I3, respectively. Such a tuple of corresponding features is called correspondence. If the cameras and their positions in space are known, it is possible to reconstruct 3D points of the observed object directly. This can be achieved by intersecting the rays from the camera centers through the feature points of one particular correspondence. This technique is called triangulation. However, in the case of unknown camera parameters, the images only shall be used.
Literature teaches several approaches for solving the problem of estimating three-dimensional structures of the scene and motion of the cameras from point correspondences in two-dimensional image sequences. Classical SfM pipelines process images in batch and handle the modeling process making no assumptions on the imaged scene or on the acquisition rig (Farenzena, Fusiello, & Gherardi, 2009). U.S. Pat. No. 8,837,811 (2014) describes a SfM pipeline to estimate camera external parameters using a two-step approach. Existing pipelines either assume known internal parameters, or constant internal parameters, or rely on EXIF data combined with external information (camera CCD dimensions) (Crandall, Owens, Snavely, & Huttenlocher, 2011) (Wu, 2013). Despite automatic internal camera calibration method (autocalibration) are already known in literature (Triggs, 1997) (Gherardi & Fusiello, 2010) (Toldo, Gherardi, Farenzena, & Fusiello, 2015), what is missing is a reliable system to estimate both internal and external parameters with a clustering procedure that favors the clustering of cameras with the same internal parameters.
Every SfM pipeline employs robust keypoint detectors and descriptors in the very first phase. Keypoints are distinctive and repeatable points that are extracted from each image. Typical detectors used in SfM pipelines are based on Laplacian of Gaussian (Lindeberg, 1998) or Difference of Gaussian (U.S. Pat. No. 6,711,293 B1, 2004). The neighborhood of every keypoint is coded into a descriptor, i.e. a numerical description of the properties of the image patch surrounding the keypoint. This allows the keypoints to be matched across different images. Several descriptors have been proposed in literature (Tola, Lepetit, & Fua, 2010; U.S. Pat. No. 6,711,293 B1, 2004) (US Patent No. EP1850270 B1, 2010). In SfM it is important to use descriptors both robust to noise and photometric deformation. While time is not a critical issue during the keypoint extraction phase, it becomes important during descriptors matching because the complexity can rise fast with the number of employed images. What is missing is a descriptor and a matching algorithm specifically tailored for Structure from Motion tasks and a procedure to discover neighbor images by using the least information possible.
When the camera internal and external parameters are extracted, a dense point cloud and a surface are extracted. The goal of Multi-view Stereo (MVS) is to extract a dense 3D surface or point cloud reconstruction from multiple images taken from known camera viewpoints. The camera internal and external parameters may come from an automatic SfM approach or a pre-calibrated environment. This is a well-studied problem with many practical and industrial applications. Laser scanners yield to very accurate and detailed 3D reconstructions. However, they are based on expensive hardware, difficult to carry and rather complex to set, especially for large-scale outdoor reconstructions. In all these cases, MVS can be applied successfully. In Seitz, Curless, Diebel, Scharstein, & Szeliski (2006) several multiview stereo algorithms are presented and a full taxonomy is drawn. In US Patent No. US20130201187 (2013) an image-based multi-view stereo process is applied to face generation. The method makes use of facial landmarks detection in a multi-view stereo process. What is missing is a scalable MVS stereo system strongly guided by the structure and visibility information extracted by the SfM information pipeline.
Some MVS techniques use automatic or manually extracted silhouettes for reconstructing 3D object visible that are completely visible in all images or to simply discard the background information. The silhouettes can also be used to enhance the reconstruction process (Hernandez Esteban & Schmitt, 2004). In US Patent No. US20140219550 (2014) silhouettes are extracted in an automatic way and are used to estimate poses of articulated 3D object. When using silhouettes, the major problem is that simple errors during the extraction process lead to big reconstruction errors. What is missing is an accurate general purpose procedure for guided silhouette extraction.