1. Technical Field
The present invention is directed to a method of matching image features with reference features, comprising the steps of providing a current image captured by a capturing device, providing reference features, detecting at least one feature in the current image in a feature detection process, and matching the detected feature with at least one of the reference features.
2. Background Information
Many tasks in processing of images taken by a camera, such as in augmented reality applications and computer vision require finding points or features in multiple images of the same object or scene that correspond to the same physical 3D surface. A common approach, e.g. as in SIFT disclosed in David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision 60, 2 (November 2004), 91-110, 2004 (“Lowe”), is to first detect features in an image with a method that has a high repeatability. This means that the probability is high that the part in an image corresponding to the same physical 3D surface is chosen as feature for different viewpoints, different rotations and illumination settings. Features are usually extracted in scale space, i.e. at different scales. Therefore, each feature has a repeatable scale in addition to its two-dimensional position. In addition, a repeatable orientation (rotation) is computed from the intensities of the pixels in a region around the feature, e.g. as the dominant direction of intensity gradients.
Finally, to enable comparison and matching of features, a feature descriptor is needed. Common approaches use the computed scale and orientation of the feature to transform the coordinates of the feature descriptor, which provides invariance to rotation and scale. The descriptor is for instance an n-dimensional real-numbered vector, which is usually constructed by concatenating histograms of functions of local image intensities, such as gradients as disclosed in Lowe.
Given a current feature, detected in and described from a current intensity image, an important task is to find a feature that corresponds to the same physical surface in a set of provided features that will be referred to as reference features. A naïve approach would find the nearest neighbor of the current feature's descriptor by means of exhaustive search and choose the corresponding reference feature as match. More advanced approaches employ spatial data structures in the descriptor domain to speed up matching. Unfortunately, there is no known method that would enable nearest neighbor search in high-dimensional spaces, which is significantly faster than exhaustive search. That is why common approaches use approximate nearest neighbor search instead, e.g. enabled by space partitioning data structures such as kd-trees as disclosed in Lowe.
Limitations of the Standard Approaches:
With an increasing number of reference features, the time to match a single current feature goes up, making real-time processing impossible at some point. Also, the distinctiveness of feature descriptors decreases with the overall number of reference features. While the first problem can be addressed with optimized data structures enabling fast approximate nearest neighbor search up to a certain extent, the second problem cannot be solved without incorporating any further information.
Already Proposed Solutions:
Gerhard Reitmayr and Tom W. Drummond. Initialisation for Visual Tracking in Urban Environments. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR '07). IEEE Computer Society, Washington, D.C., USA, 1-9, 2007 (“Reitmayr”) describe an outdoor Augmented Reality system that relies on visual tracking. To initialize the visual tracking, i.e. to find the position and orientation of the camera with respect to the world without any knowledge from a prior frame, they use GPS to gain a coarse position of the device. Given this position, they try to initialize visual tracking with a constrained camera position at a number of position samples around the rough GPS measure until initialization succeeds.
Gerhard Schall, Daniel Wagner, Gerhard Reitmayr, Elise Taichmann, Manfred Wieser, Dieter Schmalstieg, and Bernhard Hofmann-Wellenhof. Global pose estimation using multi-sensor fusion for outdoor Augmented Reality. In Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality, 2009 (“Schall”) combine a differential GPS/IMU hardware module with barometric height measurements in a Kalman filter to improve the accuracy of the user's 3D position estimate. This filtered inertial tracking is again combined with a drift-free visual panorama tracker that allows for online learning of natural features. The method does not use any (offline learned) reference features.
Different approaches exist that are based on a set of geo-referenced local image features acting as reference features. The assumption of these approaches is that if the position of the capturing device is approximately known, e.g. by GPS, only those reference features are possibly visible that are located in the vicinity of the capturing device. Some examples of this class of approaches are described in the following.
A. Kumar, J.-P. Tardif, R. Anati, and K. Daniilidis. Experiments on visual loop closing using vocabulary trees. In Computer Vision and Pattern Recognition (CVPR) Workshops, June 2008. Anchorage, Ak. (“Kumar”) use GPS positioning to narrow down the search area to the vicinity of the capturing device and use a set of pre-built vocabulary trees to find the best matching image in this search area.
Similarly, D. Chen, G. Baatz, K. Koeser, S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk. City-scale landmark identification on mobile devices. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2011 (“Chen”) use priors on the device position by GPS to improve feature matching. They discuss both using one small vocabulary tree for every spatial region as disclosed in Kumarand compare this with an approach using one global vocabulary tree and incorporating the GPS position as a prior in the feature match scoring process. They come to the conclusion that the second approach would provide better results.
Clemens Arth, Daniel Wagner, Manfred Klopschitz, Arnold Irschara, and Dieter Schmalstieg. Wide area localization on mobile phones. In Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality (ISMAR '09). IEEE Computer Society, Washington, D.C., USA, 73-82, 2009 (“Arth”) use potentially visible sets (PVS) and thereby not only consider spatial vicinity of features but also visibility constraints. Even though they do not use GPS in their indoor experiments, they explain how GPS could be used in outdoor applications to determine the rough position of the capturing device which could then be used to retrieve the potentially visible sets of reference features for this position.
Gabriele Bleser and Didier Stricker. Advanced tracking through efficient image processing and visual-inertial sensor fusion. Computer & Graphics, Vol. 33, Pages 59-72, Elsevier, New York, 2/2009 present a visual inertial tracking method that applies inertial sensors to measure the relative movement of the camera from the prior frame to the current frame. This knowledge is used to predict the position and therefore define a 2D search space in the image space for features that are tracked from frame to frame. As they use measurements of relative camera transformations only, their technique is not suited for the initialization of camera pose tracking.
While Schalldoes not use any reference feature or any model of the environment, the method described in Reitmayrdoes, but does not limit the search space of reference features based on the measured GPS position. However, Kumar, Chen, and Arthdo so. Based on the position of the capturing device, they decide which reference features might be visible and which are most likely not. This decision is taken on a per-image-level, meaning the same subset of reference features is considered a possible match for all current features in a current camera image.
Modern handheld devices provide much more sensor data than the device position. For instance, digital compasses measure the orientation of the device with respect to north and inertial sensors provide the device orientation with respect to gravity. Also the 2D position of a feature in the camera image, and its depth, if available, contain useful information in some applications.
As set out above, known prior art approaches narrow the search space to features in the vicinity of the capturing device and thereby avoid matches with far away features. However, in particular in urban environments, it is often the case that similar looking features, which tend to cause mismatches, are located close to each other and may be both visible at the same time. Examples include the windows of a building façade which usually all look very similar. It is obviously likely to have many similar looking windows visible in the camera image at the same time. This makes any global approach, i.e. narrowing search space on a per-image-level, as set out above, unsuitable to prevent mismatches.
It is an object of the present invention to provide a method of matching image features with reference features which is capable of improving real-time processing while maintaining the distinctiveness of feature descriptors even with an increasing overall number of reference features.