Systems and methods that track the pose of a camera while simultaneously reconstructing the 3D structure of a scene are widely used in augmented reality (AR) visualization, robotic navigation, scene modeling, and computer vision applications. Such a process is commonly referred to as simultaneous localization and mapping; (SLAM). Real-time SLAM systems can use conventional cameras that acquire a two-dimensional (2D) image, depth cameras that acquire a three-dimensional (3D) point cloud (a set of 3D points), or red, green, blue and depth (RGB-D) cameras that acquire both a 2D image and a 3D point cloud, such as Kinect®. Tracking refers to a process that uses a predicted motion of a camera for sequentially estimating the pose of the camera, while relocalization refers to a process that uses some feature-based global registration for recovering from tracking failures.
SLAM systems using a 2D camera are generally successful for textured scenes, but are likely to fail for textureless regions. Systems using a depth camera rely on geometric variations in the scene, such as curved surfaces and depth boundaries with the help of iterative-closest point (ICP) methods. However. ICP-based systems often fail when the geometric variations are small, such as in planar scenes. Systems using an RGB-D camera can exploit both texture and geometric features, but they still require distinctive textures.
Many methods do not clearly address the difficulty in reconstructing 3D models that are larger than a single room. To extend those methods to larger scenes, better memory management techniques are required. However, memory limitation is not the only challenge. Typically, room-scale scenes have many objects that have both texture and geometric features. To extend to larger scenes, one needs to track the camera pose in regions, such as corridors, with limited texture and insufficient geometric variations.
Camera Tracking
Systems that use 3D sensors to acquire 3D point clouds reduce the tracking problem to a registration problem given some 3D correspondences. The ICP method locates point-to-point or point-to-plane correspondences iteratively, starting from an initial pose estimate given by camera motion prediction. ICP has been widely used for line-scan 3D sensors in mobile robotics, also known as scan matching, as well as for depth cameras and 3D sensors producing full 3D point clouds. U.S. 20120194516 uses point-to-plane correspondences with the ICP method for pose tracking of the Kinect® camera. That representation of a map is a set of voxels. Each voxel represents a truncated signed distance function for the distance to a closest surface point. That method does not extract planes from 3D point clouds; instead, the point-to-plane correspondences are established by determining the normal of a 3D point using a local neighborhood. Such ICP-based methods require scenes to have sufficient geometric variations for accurate registration.
Another method extracts features from RGB images and performs descriptor-based point matching to determine point-to-point correspondences and estimate the camera pose, which is then refined with the ICP method. That method uses texture (RGB) and geometric (depth) features in the scene. However, handling textureless regions and regions with repetitive textures using only point features is still problematic.
SLAM Using Planes
Plane features have been used in several SLAM systems. To determine the camera pose, at least three planes whose normals span 3  are required. Thus, using only planes causes many degeneracy issues especially when the field of view (FOV) or range of the sensor is small such as in Kinect®. A combination of a large FOV line-scan 3D sensor and a small field-of-view (FOV) depth camera can avoid the degeneracy with an additional system cost.
The method described in the related Application uses a point-plane SLAM, which uses both points and planes to avoid the failure modes that are common in methods using one of these primitives. That system does not use any camera motion prediction. Instead, that system performs relocalization for all the frames by locating point and plane correspondences globally. As a result, that system can only process about three frames per second and encounters failures with some repetitive textures due to descriptor-based point matching.
The method described in the related Application also presents registering 3D data in different coordinate systems using both point-to-point and plane-to-plane correspondences.