Visual mapping systems rely on spatial features (also referred to as “visual features”) detected in imagery captured by a mobile device, as well as inertial information, to determine the current position and orientation of the mobile device in a three-dimensional (3D) space. Typically, the position and orientation are determined in the context of a defined coordinate frame so as to facilitate various functionality that requires synchronization to a known, fixed reference frame, such as virtual reality (VR) functionality, augmented reality (AR) functionality, or gaming or other device-enabled interactions between multiple mobile devices. Simultaneous localization and mapping (SLAM) techniques enable a mobile device to map a previously unmapped area while concurrently learning its position and orientation within the area. Thus, when the mobile device returns to the same area, it may readily determine its current position and orientation within that area through detection of previously-observed spatial features in a process known as “localization.” However, when the mobile device is entering an area for the first time, the mobile device lacks these previously-detected localization cues. In conventional visual mapping systems the mobile device must “learn” the area through implementation of the visual mapping process—a process which takes considerable time and resources. To avoid the delay involved in performing the visual mapping process for a previously-unmapped area, conventional visual mapping systems instead may revert to detection of orientation or position of the mobile device based on non-visual orientation input, such as global positioning system (GPS) information or location mapping via inertial sensor feedback. However, these non-visual mapping solutions can be unreliable (e.g., poor GPS reception indoors or in areas surrounded by tall obstructions), imprecise, and prone to error due to sensor and measurement drift.