Position and orientation (pose) estimation describes the task of calibration or alignment of a camera viewpoint with respect to an environment, which may be known or unknown. Image based pose estimation methods are useful for estimating a six degrees of freedom (6DOF) pose. Image based pose estimation traditionally requires some reconstruction or 3D model of the scene. For example, SLAM (simultaneous location and mapping) or SFM (structure from motion) systems can reconstruct three-dimensional (3D) points from incoming image sequences captured by a camera and are used to build a 3D map of a scene (i.e., a SLAM map) in real-time. From the reconstructed map, it is possible to localize a camera's 6DOF pose in a current image frame.
Accurate 6DOF self-localization with respect to the user's environment is beneficial for correct and visually pleasing results in Augmented Reality (AR) applications. Due to the interactive nature of AR applications, localization time has a direct impact on the user experience of an AR application, because it determines how long the user must wait before interaction with the AR application may start. Thus, it is desirable to localize a mobile device quickly with the limited processing power found in mobile devices, while maintaining accuracy in the 6DOF pose for the desired application.
However, 6DOF pose initialization may be difficult to achieve in certain scenarios. For example, in outdoor environments capturing sufficient camera baseline to initialize the SLAM algorithms is challenging. Additionally, SLAM may provide relative poses in an arbitrary referential with unknown scale, which may not be sufficient for AR systems such as navigation or labeling of landmarks. Existing methods to align the local referential of a SLAM map with the global referential of a 3D map with metric scale have required the user to wait until the SLAM system has acquired a sufficient number of images to initialize the 3D map. The waiting required for initialization is not ideal for real-time interactive AR applications. Furthermore, certain AR systems require specific technical movements of the camera to acquire a series of images before the SLAM map can be accurately initialized to start tracking the camera pose.
Additionally, methods to align a captured image frame with a 2.5D or 3D map may be limited by the relatively poor accuracy of mobile sensors in estimating the camera pose. For example, an approach using strictly a Global Positioning System (GPS) to estimate the actual position and viewing direction of a user may be insufficient and leave AR content floating around in the actual user view. Therefore, improved methods are desirable.