1. Field of the Invention
This invention relates generally to image processing, and more specifically, to generating multi-view feature descriptors of scenes from a video stream for subsequent mapping and determining a location within the map after an appearance variation.
2. Description of Related Art
Computer vision is a giant step in computer intelligence that provides a myriad of new capabilities. For example, the ability of a computer to determine its current location while attached to a robot or other mobile vehicle allows the robot to autonomously interact with its environment. To update location during motion, some computers use odometry techniques to measure how far and in which direction the robot has traveled from a known location. However, such measurements are only valid through uninterrupted travel, and drift significantly over time. A ‘kidnapped’ robot is moved from one position to another without any information about its new location. Because the robot is unable to reorient itself without any odometry information, it can no longer provide accurate localization. Thus, some computers use image processing techniques to recognize the new location from training data, and thus estimate position.
Problematically, conventional image processing techniques use a single view of a scene to gather training data. More particularly, these conventional systems use two-dimensional (2-D) images of a three-dimensional (3-D) scene during training to gather information for pattern matching during recognition. But the 3-D scene has different appearances in 2-D images depending on various factors such as which viewpoint the image is captured, illumination, occlusion, and the like. Consequentially, a conventional system with training data of a scene with one appearance has difficulty in recognizing the same scene through an appearance variation. Even systems that allow some variability are limited to small baseline changes and will thus fail in response to wide baseline changes. Generally, small baseline changes are slight variations such as an offset of a few degrees or a slightly different scale, whereas large baseline changes, in extreme, can be a 180-degree variation or a doubling in size.
Unfortunately, conventional image processing techniques cannot support applications such as Simultaneous Localization and Mapping (SLAM) without accurate position information. A robot performs SLAM to build a map of unknown surroundings while determining its location within the surroundings. If position data is not available, the robot can no longer perform position-dependent interactive or autonomous actions such as navigation. Additionally, the robot cannot continue building a unified map of the surroundings.
Therefore, what is needed is a robust image processing system that uses multiple view feature descriptors for recognition in applications such as SLAM. Furthermore, the system should use video data already available during SLAM operations to generate the feature descriptors with sparse data.