Estimating the trajectory of a camera is notably used in augmented reality applications which merge virtual objects in digital images of a real scene. The main difficulty with augmented reality systems that use a single camera is how to estimate as accurately as possible the 3D registration between the real scene (or real environment) and the 3D virtual information to obtain a realistic merging. This 3D registration entails determining at any instant the pose of the camera, that is to say its position and its orientation in relation to a fixed reference frame on the scene.
The estimation of the pose of a camera relative to a 3D scene is a very active research topic.
Most of the existing methods, notably for tracking 3D objects, consider only a known part of the scene, in this case the 3D-modelled part, of an object of interest. Among these methods, those which are model-based, or “model-based tracking”, and those which are based on learning, can be distinguished.
The model-based methods consist in calculating the 6 pose parameters of the camera by minimizing, for each of the images picked up by the camera, the distance between the edges of the projected 3D model and the edges detected in the image. One example of such a method is described by Vincent Lepetit and Pascal Fua in the publication “Monocular model-based 3d tracking of rigid objects: A survey”, in FTCGV, 2005. The main limitation of these methods is that they work only if the object is always visible in the sequence of images. To obtain an accurate pose, it is also necessary for the object of interest to take up a good portion of the image, or, to put it another way, to be “close” to the camera. Furthermore, the movements of the camera must be small to be able to ensure the 3D tracking.
The learning-based methods require a preliminary so-called learning phase which consists in learning the photometric aspect (that is to say the appearance) of the object. This phase consists in enriching the 3D model of the object by texture descriptors extracted from the images. Two types of learning are possible:                Coded markers of known positions are placed around the object so as to estimate the pose of the camera for a few points of view. A coded marker (also called coded target) is an optical marker of known size that can easily detected in the image and identified by its code. For each of these points of view, points of interest are extracted from the image and characterized by the surrounding texture, and then are associated directly with the 3D points that correspond to them on the object by a single projection from the viewpoint of the camera, the latter being known for each of these points of view by virtue of the coded targets. An example is presented by Juri Platonov and Hauke Heibel and Peter Meier and Bert Grollmann in the publication “A mobile markerless AR system for maintenance and repair” in ISMAR, 2006.        A cloud of 3D points is estimated by matching2D points of a video sequence and by using a technique of reconstruction by SfM, the acronym SfM standing for “Structure from Motion”. Then, this cloud of 3D points is realigned offline and semi-automatically on the 3D model of the object to obtain 3D points belonging to the model, enriched by descriptors extracted from the images. P. Lothe, S. Bourgeois, F. Dekeyser, E. Royer and M. Dhome can be cited, who describe an example of this method in the publication “Towards geographical referencing of monocular slam reconstruction using 3d city models: Application to real-time accurate vision-based localization”, in CVPR, 2009. Once this learning phase has been carried out, the calculation of poses online is performed by associating the 2D points extracted from the current image with the 3D points of the object by using a criterion of likelihood of the descriptors.        
The two main limitations of these methods are that, on the one hand, they require a preliminary learning phase and, on the other hand, they are very sensitive to the changes of photometric appearance of the object between the learning phase and the pose calculation phase (worn objects, variations of the lighting conditions). Furthermore, these methods work only on strongly textured objects.
Globally, the main limitation of these methods that consider only the known part of the object is that they work only if the object is always visible in the sequence of images. If the object is totally occulted or if it disappears from the field of view of the camera these methods can no longer calculate the pose of the camera.
These methods are also subject to “jittering” (tremors in augmented reality due to instabilities of the poses calculated from one image to the next) and to obtain an accurate pose estimation, it is necessary for the object of interest to take up a lot of space in the image. In practice, the information concerning the environment is not taken into account in estimating the pose of the camera.
Other methods consider a camera moving in a totally unknown environment. The methods of SfM type or of SLAM “Simultaneous Localization And Mapping” type estimate the movement of a camera without any a priori knowledge of the geometry of the scene observed. Offline then online methods have been proposed. They are very stable because they use the whole of the observed scene to be located. They consist in incrementally estimating the trajectory of the camera and the geometry of the scene. For this, these algorithms make use of the multi-view relationships (a view being an image) to estimate the movement of the camera, possibly with a 3D reconstruction of the scene (in the form of a sparse cloud of 3D primitives: points, straight line segments, etc.). An additional optimization step, which consists in simultaneously refining the poses of the camera and the reconstructed 3D scene, is generally performed. The latter step is called bundle adjustment. The main drawback with the algorithms of SLAM type is that they are subject to accumulations of errors and therefore to a drift in the trajectory over time. Their use in applications which demand great 3D registration accuracy at all times (example: augmented reality) can therefore not be considered in their original form. Also, in the monocular case, the reconstruction is performed on an arbitrary scale; the real scale can be known only by the addition of additional information concerning the metric of the scene; the reconstruction is also performed in an arbitrary reference frame which is not linked to an object of the scene.
Finally, more recently, some methods try to successively combine these two approaches. Methods that use, successively, a model-based approach then SfM techniques have been proposed to estimate the pose of the moving camera in a partially known environment. Bleser et al., in “Online camera pose estimation in partially known and dynamic scenes”, in ISMAR, 2006, make use of the geometrical constraints of the model to initialize the reference frame and the scale of the reconstruction of the SLAM algorithm. The location of the camera is then calculated by a “conventional” method of SLAM type which no longer takes account of the 3D model.
The accuracy during initialization is not guaranteed since it is done on a single view, and, in addition, the method remains subject to accumulations of numeric errors and to a drift of the scale factor. As previously specified, the location based on SLAM or SfM type methods does not allow for an accurate location in the medium and long term: problems of drift, etc.
The method described by V. Gay-Bellile, P. Lothe, S. Bourgeois, E. Royer and S. Naudet-Collette in “Augmented Reality in Large Environments: Application to Aided Navigation in Urban Context”, in ISMAR, 2010, combines an SLAM technique and a technique of relocation using a prior learning. It therefore makes it possible to calculate the pose of the camera when the object is no longer visible by means of SLAM and avoids the drift by virtue of the relocation. However, this method requires a preliminary learning phase of learning-based method type.
The latter two methods successively use the constraints of the model then those of the environment.
Similarly, a method that successively uses the constraints of the environment then those of the model has been proposed by Lothe et al. in “Real-Time Vehicle Global Localisation with a Single Camera in Dense Urban Areas: Exploitation of Coarse 3D City Models”, in CVPR, 2010. In this case, a first reconstruction of the environment is performed, then, in a second stage, a process based on a method of rigid Iterative Closest Point (ICP) type is used to realign the reconstruction on the model. It consists in realigning, when possible (that is to say when the model provides sufficient geometrical constraints), the trajectory of the camera by using only the information of the model. The major drawback with this method is that, in order to conserve the multi-view constraints in the model 3D registration step, they apply a similar transformation to all of the cameras included in the process, which is a big assumption to make in practice. The same drawback as with the model-based methods applies: lack of accuracy and robustness when the object of interest is observed little or not at all. Also, since this method is performed in two successive steps, it is not optimal and does not ensure an accurate real time location at each instant: the correction by virtue of the model is made in an a posteriori step, so the corrected pose for the current image is supplied with a time delay making the method unsuited to applications such as augmented reality.
Consequently, there remains to this day a need for a method for locating the camera and for the 3D reconstruction of the static environment in which the camera is moving, that simultaneously satisfies all the abovementioned requirements, in terms of accuracy, robustness, stability, and does so in real time.