In robotic mapping, simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map and/or a model of an unknown environment while simultaneously keeping track of an agent's location within the environment. While SLAM relates to the building of a map (mapping) and the use of the map (localization), a process associated with localization and a process associated with mapping need not actually be performed simultaneously for a system to perform SLAM. For example, procedures can be performed in a multiplexed fashion.
In some applications, e.g., in urban or indoor environment, GPS or another position estimation system is not available, practical or accurate enough for the SLAM. To that end, some systems, additionally or alternatively to the usage of specialized position estimation systems, rely on other line-of-sight sensors, like a camera, using a class of techniques named visual SLAM. Visual SLAM (VSLAM) uses visual sensor data or images as input to build a model of the environment, e.g., a point cloud representing the environment. For example, VSLAM uses line-of-sight sensors for acquiring images of surrounding environments and for registering multiple such images into a consistent coordinate system, e.g., a global coordinate system, to form a model describing both the geometry and appearance of surrounding environments.
VSLAM estimates the six degrees-of-freedom (DOF) poses (location and orientation) of the sensor inside that coordinate system using images captured by the sensor. To that end, VSLAM relies on the ability to find correspondences of a same physical region observed in different images. However, VSLAM suffers from the large-baseline matching problem, i.e., a region observed from two faraway views can be frequently missed during such matching process, because the appearances of the same region viewed from different viewpoints can change significantly.
Some methods address this problem by combining VSLAM techniques with separate pose estimation methods. For example, the method described in U.S. Pat. No. 7,162,338 uses motion sensors to estimate the pose of the robot carrying the camera. The usage of the motion sensors, although useful, is not always desirable.
Another method continuously tracks the pose of the sensors by taking multiple images ensuring small pose variation between the images, see, e.g., U.S. 20140126769. However, this method is computationally and memory expensive and can require the sensor to follow a laborious and complicated trajectory within an environment in order to construct its 3D model.
Accordingly, there is a need for VSLAM suitable for constructing a 3D model of the scene with a reduced number of images used for tracking the pose of the sensor. If given a same number of images, such VLSAM should achieve higher 3D reconstruction and pose estimation accuracy, as well as a larger number of reconstructed 3D points.