Augmented reality (AR) is a field of computer research which deals with the combination of real world and computer-generated data, where computer graphics objects are blended into real footage in real time. The majority of augmented reality image capturing systems operate with predetermined information about the environment of a user (i.e. in some form of map). The user is allowed to interact with the environment based on the predetermined information. If the map provided is comprehensive, registration can be performed directly from the map, which is a common method used in camera-based augmented reality tracking. Unfortunately, creating a comprehensive map is difficult and time-consuming. Such a map is often created manually by trained technicians, and the map is generally not sufficiently accurate unless the map is optimized by a minimisation method which is again computationally expensive.
Parallel tracking and mapping (PTAM) is an algorithm, particularly used in handheld devices such as a camera, to perform real-time tracking in scenes without the need of any prior map. A user may first place such a camera above a workspace to be tracked and press a key to select an initial keyframe for map initialisation. Typically, about one thousand (1000) natural features are extracted from the initial keyframe and tracked across subsequent frames. The user may then smoothly translate the camera to a slightly offset position and make a second key-press to provide a second keyframe. A known five-point-pose algorithm may then be used to estimate relative camera pose and triangulate the initial map using the selected key-frames and tracked feature correspondences.
One disadvantage of the five-point-pose algorithm is the requirement for human interactions during map initialisation. Some users do not understand a stereo baseline requirement required for triangulation and attempt to initialise a camera or the like using pure rotation. In addition, the five-point-pose algorithm also requires long uninterrupted tracked features. Any unintentional camera rotation and drastic camera motion may cause feature matching to fail, leaving few tracked features for map initialisation. Another method of performing real-time tracking in scenes assumes a user is initially viewing a planar scene. As the user moves a camera after selecting an initial keyframe, homography hypotheses between a current frame and an initial keyframe are generated at each frame from matched features. Each homography hypothesis is then decomposed into two or more possible three-dimensional (3D) camera poses. A second keyframe is selected based on a condition number. The condition number is the ratio of minimum to maximum eigenvalues of information matrix JTJ, where J is the Jacobian matrix of partial derivatives of each points' projection with respect to eight (8) degrees of freedom (DOF) changes to decomposition. Such a method is also not optimal since the condition number only gives indication of the scale of the errors with respect to parameters in the decomposition and does not relate directly to accuracy of 3D map points.
Another method of performing real-time tracking in scenes is a model-based method, based on the Geometric Robust Information Criterion (GRIC) model. In such a model-based method, a GRIC score is computed based on feature correspondences between an initial keyframe and a current frame. For each frame, a score is computed for each of two models (i.e., epi-polar and homography). The homography model best describes the correspondences for stereo images with a small baseline. The epi-polar model takes scene geometry into account but requires a larger baseline. A second keyframe is selected when the GRIC score of the epi-polar model is lower than the GRIC score of the homography model. However, such model-based methods require long continuous uninterrupted tracked features and computation of re-projection errors for each tracked feature for both homography and epi-polar models, which can be computationally expensive.
Other methods of performing real-time tracking in scenes make an implicit assumption that a sufficiently accurate initial 3D map can be created when either temporal distance between two keyframes or track length of tracked features is larger than a fixed threshold. Such assumptions are often incorrect since distance of the features from a camera affects required distance between keyframes.