Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to the prior art by inclusion in this section.
Three-dimensional (3D) position tracking is one of the fundamental techniques required to align the virtual and real world together in Augmented Reality applications, but is also applicable to many other useful application. Visual SLAM (Simultaneous Localization and Mapping) is the dominant technology for reconstruction of camera trajectory and the environment, especially in absence of any prior knowledge on the environments. A visual SLAM system typically consists of a front-end and a back-end. Visual odometry is the front-end that estimates the per-frame camera pose, which is essential in Augmented Reality applications to seamlessly align virtual objects with the real world. In the back-end, normally, camera poses are refined through a global optimization. Some methods also reconstruct the model of the environment which can also be further used for many other purposes, such as 3D mapping, physically-based simulation, and animation. The majority of existing visual SLAM techniques can be grouped into two categories, frame-to-frame and frame-to-model approaches, based on how the back-end map information is maintained and utilized for front-end pose estimation.
Frame-to-frame approaches are typically keyframe-based and rely on pose-pose constraints between a frame and a keyframe for pose estimation (e.g., DVO and σ-DVO). Specifically, a set of keyframes are identified during the visual odometry. For each frame associated with a keyframe, a relative camera pose is computed with respect to this keyframe. When loop closure is detected, the current keyframe can be associated with previous keyframes to create more pose-pose constraints. Considering all the frames, a pose graph can be constructed that represents the pose-pose constraints across frames. Then a pose-graph optimization can be performed to refine the camera pose of each frame. However, frame-to-frame approaches do not maintain a global representation of the scene (e.g., point cloud) and suffer from accumulated camera drift.
In the frame-to-model approaches, such as PTAM, KinectFusion, and ElasticFusion, a global map of the scene is usually maintained and updated. Popular map representations include point clouds, surfel clouds, and volumetric fields. These approaches can provide accurate models of the environment. However, camera pose estimation is typically performed in a frame-to-model fashion, where only the pose-point constraint between the global map and the current frame observation is considered. Normally, the back-end optimization is performed to only optimize the keyframe poses and the global point cloud with pose-point constraints, which means that the other frames between keyframes are never optimized. Some frame-to-model approaches do not perform any form of global optimization at all. Consequently, the accuracy of the final camera trajectory is limited. For the approaches that only maintain sparse point clouds, generating accurate dense meshes is very challenging. Moreover, approaches using dense map representations, especially volumetric representation, typically suffer from low spatial resolution and rely heavily on GPU acceleration.
RGBD dense Visual SLAM approaches have shown their advantages in robustness and accuracy in recent years. However, there are still several challenges such as sensor noise and other inconsistences in RGBD measurements across multiple frames that could jeopardize the accuracy of both camera trajectory and scene reconstruction. It would be beneficial to provide a dense visual SLAM method that properly accounts for sensor noise and other inconsistences in sensor measurements. It would be further advantageous, the method utilized a both pose-pose and pose-point constraints in an efficient manner for back-end optimization.