A variety of applications rely on the use of navigation systems, both civilian and military, such as in personal location and route planning assistance, autonomous robot navigation, unknown environment map building, naval and aeronautic tactical fighting systems, land surveys, etc. Unfortunately, many existing navigation systems do not function very well under certain circumstances. For example, GPS (Global Positioning System) is widely used in many of the aforementioned applications. In certain circumstances, however, GPS cannot work reliably if the satellite signals upon which it is based are blocked or unavailable as often occurs in indoor environments, in forests, and in urban areas. Even when it works well, GPS can only provide the location of the user, which is usually not sufficient to assist the user during navigation. For example, when a group of warfighters is performing a military task in an unknown environment, in addition to a warfighter needing to know where each of the other war fighters is located, it would be desirable to see what each of the other warfighters is seeing to foster better cooperation and coordination.
A vision-based navigation system can meet these challenges. Specifically, a vision-based navigation system does not require expensive equipment and can independently estimate 3D position and orientation (pose) accurately by using image streams captured from one, two or more inexpensive video cameras. Vision-based navigation systems can be integrated with GPS and inertial measurement unit (IMU) systems to robustly recover both the location and the 3D gaze or orientation of a user under a wide range of environments and situations. In addition, detailed information and imagery of an environment can be recorded in real time. The imagery can be shared and analyzed to assist the user in reporting what is seen.
A variety of efforts have been made to build a navigation system using vision approaches in the past few decades. In most approaches using computer vision techniques, a set of stationary feature points in the scene is tracked over a sequence of images. The position and orientation change of a camera is determined using the image locations of the tracked feature points. The motion estimation can be done use monocular, binocular (stereo) or multi-camera configurations. In stereo or multi-camera configurations, the 3D location of scene points can be estimated by the binocular disparity of the feature points. The estimated 3D point locations may then be used to solve the motion of the camera by a 3D/2D motion estimation. In a monocular configuration, both the relative motion of the camera and the 3D locations are estimated simultaneously. The latter technique has problems with stability, therefore, visual odometry systems based on stereo have been favored over monocular-based visual odometry systems.
A visual odometry system can often drift over time due to errors associated with stereo calibration or image quantization, poor-quality images, inaccurate feature positions, outliers, and instability of motion estimation using noisy correspondences. Most existing stereo-based visual odometry systems compute the pose between each pair of image frames separately, which is referred to as the frame-by-frame approach. Compared to traditional frame-by-frame approaches, experiments have shown up to a 27.7% reduction in navigation error when multi-frame tracking is performed. However, prior art multi-frame tracking approaches lack a metric to stop tracking when the tracked feature points become insufficient for pose estimation in terms of either quantity or spatial distribution.
Most of the stereo-based visual odometry systems in the prior art estimate pose from established 3D/2D feature correspondences. Since 3D coordinates of each feature point are reconstructed using the stereo-based triangulation, error introduced during 3D reconstruction needs to be minimized. However, in some stereo-based visual odometry systems, during stereo matching, no stereo geometric constraints are utilized to reduce the search region and a large amount of false stereo matches results.
Scenes to be monitored frequently contain moving objects such as walking persons, moving vehicles, waving trees, etc. If features in moving objects are selected during pose estimation, these features can negatively affect the accuracy of the resulting pose unless they are detected and discarded as outliers. In a nearly polar opposite scenario, there are situations where accuracy may degrade due to lack of features in the scene. For instance, the field-of-view of cameras may be occupied by a non-texture surface where feature detection is largely inhibited.
Accordingly, what would be desirable, but has not yet been provided, is stereo-based video odometry method for obtaining a pose of an object from a sequence of images which overcomes many of the problems described above in the prior art.