Three-dimensional (3D) reconstruction of objects or scenes from two-dimensional camera images is a well studied research area. In a typical approach, a number of corresponding features (e.g., points) in the scene are identified in each of several images, which in the most general case may have been captured by one or several cameras from camera positions and orientations that are not precisely known a priori. These correspondences between images give constraints on the relative position of the cameras; with enough correspondences, relative camera positions may be determined for the images. Once the relative camera positions are determined, the features can be determined in 3D space by triangulation.
A commonly used mathematical model for the camera is the perspective camera model. According to this model, the projection of a point in 3D space with coordinates X to an image point with coordinates x can be described with the camera equation:
                                          λ            ⁡                          (                                                                    x                                                                                        1                                                              )                                =                      K            ⁢                                                  ⁢                          R              ⁡                              (                                  I                  ❘                                      -                    t                                                  )                                      ⁢                          (                                                                    X                                                                                        1                                                              )                                      ,                            (        1        )            where K is a upper right triangular matrix containing the intrinsic camera parameters, R is an orthogonal matrix describing the orientation of the camera, t is the position of the camera and λ is a constant. Given a number of corresponding image points (xij) in several images, the unknowns Kj, Rj, tj and Xi can be estimated by solving a big system of equations. Here i is an index for the points in 3D space and j is an index for the images.
Those skilled in the art will appreciate that using the perspective camera model discussed above, a reconstruction can be determined only up to an unknown projective transformation, absent detailed knowledge of the camera parameters and camera positions or motion. This is because the camera matrix KR(I|−t)T−1 and point TX will give the same image points as the camera matrix KR(I|−t) and point x for an arbitrary invertable matrix T, so there is no way of determining which one of these reconstructions is the correct one without further information about the camera positions or intrinsic camera parameters. If certain knowledge of the intrinsic camera parameters is available (e.g. their values or that they are constant) the reconstruction may be determined up to an unknown scale, rotation, and translation. That the scale cannot be determined can be seen by observing that the camera matrix KR(I|−st) and point
         (                            sX                                      1                      )  will give the same image point as the camera matrix KR (I|−t) and 3D point
         (                            X                                      1                      )  for an arbitrary scale s. The unknown rotation and translation stems from the choice of coordinate system in the reconstruction.
If the scale factor for a 3D reconstruction is unknown, it is impossible to make metric measurements in the 3D reconstruction, to add external 3D objects into the reconstruction (since the relative scale is unknown), to combine different 3D reconstructions into a single scene, or to generate pair of stereo images with a particular distance between the corresponding cameras. This problem is solved today in various ways. One way is to include a reference object, such as a ruler or other device of known dimensions, in the scene. Since the dimensions of the reference object are known, the scale of the 3D reconstruction can be determined. Alternatively, if multiple cameras are used and the distance between them is known, then the scale factor can also be determined. Or, if a single camera is mounted on a robot that can estimate the camera motion between image captures, the scale factor can again be determined based on the known relative position between image captures. Yet another way to deal with the problem of combining 3D models, without actually solving the scaling problem, is to manually scale the 3D reconstruction until it looks acceptable to a human operator in relation to an external object or another 3D reconstruction. If metric measurements on the external object are known, then a low-precision estimate of the scale of the 3D reconstruction can be determined. Of course, each of these approaches has serious limitations, as it may be impossible to insert a reference object or to track the relative positions of the imaging cameras, in some circumstances. Furthermore, manual processing may be undesirable or too inaccurate.