Three-dimensional pose estimation, or localization in a three-dimensional scene, of a sensor has many useful applications, such in identifying a component or replacement part of an assembled product, or in augmented reality scenarios. Localization may be performed using global positioning systems (GPS) (e.g., mostly limited to outdoor use), or using beacon-based systems with sensors mounted in the three-dimensional scene (e.g., leading to installation and maintenance costs). GPS and beacon-based systems may not provide a viewing direction with the location, and suffer from positioning inaccuracies, typically making the resulting location information not accurate enough to be useful for many applications.
Localization may also be performed using simultaneous localization and mapping (SLAM). SLAM was developed for robotic mapping where a scene is mapped by the robot while keeping track of the location of the robot within the scene. SLAM is a time intensive process and is challenged by dynamic scenes (e.g., scenes containing humans or other non-static objects, and/or scenes with strong appearance changes due to dynamic elements, illumination, weather, or seasonal variability), which often render SLAM mapping algorithms inoperable. SLAM methods also balance location accuracy with computationally complexity, often sacrificing accuracy to allow the algorithms to run on mobile devices with low latency.
Other traditional pose estimation techniques are based on machine vision and are performed using measurements from a single-view (e.g., depth images recorded with a 2.5D sensing device), based on the principles of stereo vision, structured light sensors, or time-of-flight sensors. Recorded measurements are processed by machine vision solutions, such as using machine learning algorithms, to deduce a pose within the three-dimensional scene. Neural networks, such as convolutional neural networks (CNNs), may be used to handle many dimensions of data. Traditional pose estimation techniques are limited by the computationally expensive nature of processing depth data. To avoid prohibitive delays in a user-operated system, systems are restricted to processing single depth measurements.