1. Technical Field
The invention is related to a method of matching image features with reference features, comprising the steps of providing a current image captured by a capturing device, providing reference features, wherein each of the reference features comprises at least one reference feature descriptor, determining current features in the current image and associating with each of the current features at least one respective current feature descriptor, and matching the current features with the reference features by determining a respective similarity measure between each respective current feature descriptor and each respective reference feature descriptor. The invention is also concerned with an integrated circuit for matching of image features with reference features.
2. Background Information
Standard Approaches, Limitations and Existing Solutions:
Many tasks in processing of images taken by a camera, such as in augmented reality applications and computer vision require finding points or features in multiple images of the same object or scene that correspond to the same physical 3D surface. For example, in augmented reality, the main problem is to determine the position and orientation of the camera with respect to the world (camera pose).
The standard approach to initialization of an optical tracking (i.e. when no knowledge from a previous frame is available) can be divided into three main building blocks: feature detection, feature description and feature matching (see FIG. 1). As the skilled person will understand, if no knowledge from a previous frame is available, that does not mean that no knowledge from non-optical sensors, like GPS or compass is allowed. Feature detection is also referred to as feature extraction.
At first, feature detection is performed for identifying features in an image by means of a method that has a high repeatability. In other words, the probability is high that the method will chose the part in an image corresponding to the same physical 3D surface as a feature for different viewpoints, different rotations and/or illumination settings (e.g. local feature descriptors as SIFT [1], shape descriptors [18] or other approaches known to the skilled person). Features are usually extracted in scale space, i.e. at different scales. Therefore, each feature has a repeatable scale in addition to its two-dimensional position. In addition, a repeatable orientation (rotation) is computed from the intensities of the pixels in a region around the feature, e.g. as the dominant direction of intensity gradients.
Next, a feature descriptor is determined in order to enable the comparison and matching of features. Common approaches use the computed scale and orientation of the feature to transform the coordinates of the feature descriptor, which provides invariance to rotation and scale. For instance, the descriptor may be an n-dimensional real-numbered vector, which is constructed by concatenating histograms of functions of local image intensities, such as gradients (as in Lowe, David G. “Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision 60.2 (2004): 91-110. (“Lowe”)).
Finally, an important task is the feature matching Given a current feature detected in and described from a current intensity image, the goal is to find a feature that corresponds to the same physical 3D surface in a set of provided features that will be referred to as reference features. The simplest approach to feature matching is to find the nearest neighbor of the current feature's descriptor by means of exhaustive search and choose the corresponding reference feature as match. More advanced approaches employ spatial data structures in the descriptor domain to speed up matching. Unfortunately, there is no known method that would enable nearest neighbor search in high-dimensional spaces, which is significantly faster than exhaustive search. That is why common approaches use approximate nearest neighbor search instead, e.g. enabled by space partitioning data structures such as kd-trees (see, Lowe).
FIG. 1 (in connection with FIG. 2) shows a flow chart of a standard method to match a set of current features with a set of reference features. In step S11, a current image CI is provided taken with a capturing device. The next step S12 then detects and describes features in the current image CI (optional: already selective extraction according to estimated model-feature-positions), where every resulting current feature c has a feature descriptor d(c) and a 2D position in the camera image cI. Possible methods that could be used for feature detection and description are explained in more detail below referring to exemplary implementations. A set of reference features r, each with a descriptor d(r) and a (partial) position and/or orientation in a global coordinate system is provided in step S13. The reference features can be extracted from reference images or 3D models or other information about the object. Please note, that the position and/or orientation in a global coordinate system is optional in case of visual search and classification tasks. In step S14, the current features c from step S12 and the reference features r from step S13 are matched. For example, for every current feature the reference feature is searched that has the closest descriptor to the descriptor of the current feature with respect to a certain distance measure. According to step S15, an application uses the feature matches, e.g. in order to estimate the position and orientation of the capturing device very accurately in an augmented reality application that integrates spatially registered virtual 3D objects into the camera image.
Limitations of the Standard Approaches:
Flexibility is important in order to initialize tracking successfully in different environments. The features described in Lowe, for example, work very well in textured environments. In environments with little texture or in cases the texture changes (e.g. the appearance of a car finish changes strongly depending on its environment and the camera position), features as in Lowe have major difficulties. Features as described in Bosch, A, Andrew Zisserman, and X Munoz. “Representing shape with a spatial pyramid kernel” Image Processing 5 (2007): 401-408 (“Bosch”) work better in non-textured environments. Therefore, feature detection and feature description algorithms are frequently adapted and changed in order to better suit a specific task.
With a growing number of reference features, the time to match a single current feature increases, making real-time processing impossible at some point due to limitations on the hardware. Also, the distinctiveness of feature descriptors decreases with a growing number of reference features, which in turn is limiting the matching quality and significantly affects the robustness.
Already Proposed Solutions:
Different approaches exist that are based on a set of geo-referenced local image features acting as reference features. The assumption of these approaches is that if the position of the capturing device is approximately known, only those reference features are possibly visible that are located in the vicinity of the capturing device. In other words, the methods aim to reduce the number of potential matches among the reference features. For example, methods were proposed that use sensor data, e.g. GPS positioning to narrow down the search area and a set of pre-built vocabulary trees for every spatial region to find the best matching image in this search region (see, Kumar, Ankita et al. “Experiments on visual loop closing using vocabulary trees.” 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 0 (2008): 1-8 (“Kumar”); Chen, David M et al. “City-scale landmark identification on mobile devices”. 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2011) (“Chen”).
In addition, better results where achieved when using a single global vocabulary tree and incorporating the GPS position as a prior in the feature match scoring process (see, Chen). The method of Reitmayr, G. and T. W. Drummond. “Initialisation for Visual Tracking in Urban Environments.” 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality (2007): 1-9 (“Reitmayr”) uses GPS to gain a coarse position of the device for the initialization of a visual tracking in an outdoor augmented reality system. Given this position, initialization with a constrained camera position at a number of position samples around the rough GPS measurement is performed until initialization succeeds. In another approach, the combination of a differential GPS/IMU hardware module with barometric height measurements in a Kalman filter is used in order to improve the accuracy of the device's 3D position estimate (see, Schall, Gerhard et al. “Global pose estimation using multi-sensor fusion for outdoor Augmented Reality.” 2009 8th IEEE International Symposium on Mixed and Augmented Reality (2009): 153-162 (“Schall”). The method of Arth, Clemens et al. “Wide area localization on mobile phones” 2009 8th IEEE International Symposium on Mixed and Augmented Reality (2009): 73-82 (“Arth”) uses potentially visible sets (PVS) and thereby not only consider spatial vicinity of features but also visibility constraints. Coarse positioning with GPS is mentioned for retrieval of PVS in an outdoor application.
The visual inertial tracking method of Bleser, Gabriele, and Didier Stricker. “Advanced tracking through efficient image processing and visual-inertial sensor fusion.” Computers & Graphics 33.1 (2009): 59-72 (“Bleser”) applies inertial sensors to measure the relative movement of the camera from the prior frame to the current frame. This knowledge is used for predicting the position and defining a 2D search space in the image space for features that are tracked from frame to frame. Since the technique uses measurements of relative camera transformations only, it is not suited for the initialization of camera pose tracking or visual search tasks.
None of the above mentioned methods suggests increasing speed and performance by accelerating the vision algorithms on hardware.
Typical approaches considering hardware acceleration of vision algorithms optimize feature detection and feature description, whereas the feature matching remains implemented in software. For example, in Yao, Lifan et al. “An architecture of optimised SIFT feature detection for an FPGA implementation of an image matcher.” 2009 International Conference on FieldProgrammable Technology (2009): 30-37 (“Yao”) the matching stays in software, while the feature detection and description are optimized on hardware: “It can be seen from Section II that the optimised SIFT algorithm for an image matcher consists of five stages: 1) Gaussian pyramid construction. 2) DoG space construction and feature identification. 3) Gradient and orientation histogram generation. 4) Feature descriptor generation. 5) Image matching Considering the nature of Xilinx FPGA embedded system, a top level system partition which is similar to Bonato, Vanderlei, Eduardo Marques, and George A Constantinides. “A Parallel Hardware Architecture for Scale and Rotation Invariant Feature Detection” IEEE Transactions on Circuits and Systems for Video Technology 18.12 (2008): 1703-1712 (“Bonato”) has been adopted for the FPGA implementation. More specifically, the first three stages are implemented as a hardware core named as SIFT feature detection module, whereas the last two stages are considered to be implemented as a software module named as SIFT feature generation and image matching module using Xilinx MicroBlaze software processor”. Same applies to Zhang, Jing et al. “Overview of approaches for accelerating scale invariant feature detection algorithm.” 2011 International Conference on Electric Information and Control Engineering (2011): 585-589 (“Zhang”). Please note that Bonato talks about the problem of high matching processing time (referred to as association) but does not propose to build a specific hardware block to solve this. Instead, they propose to accelerate their software solution by running it on faster general purpose processors.
Another example of accelerating the image processing by means of hardware acceleration is disclosed in Smith, Ross, Wayne Piekarski, and Grant Wigley. “Hand Tracking For Low Powered Mobile AR User Interfaces” Proceedings of the Sixth Australasian conference on User interface Volume 40 (1999): 7-16 (“Smith”). The authors discuss that not all of the many vision tracking algorithms may be easily implemented on an FPGA. They limit themselves explicitly to techniques that do not require complex floating point calculations in an effort to minimize the area used on the FPGA. They chose to accelerate image processing: “There are many different vision tracking algorithms but not all of them can be easily implemented on an FPGA. We have avoided using techniques that require complex floating point calculations in an effort to minimize the area used on the FPGA. We found that segmentation could be performed using very few gates.”
Therefore, it would be beneficial to provide a method which enables a higher performance and higher algorithmic flexibility at reduced processing and power requirements while performing visual computing tasks. Particularly, the method should not only enable a faster matching process but also improve matching quality by taking advantage of additional hints.