The use of computer vision in autonomous robotics has been studied for decades. Recently, applications such as autonomous vehicle navigation [5], 3D localization and mapping [17, 6, 3] and object recognition [16] have gained popularity, likely due to the increase in available processing power, new algorithms with real-time performance and advancements in high quality, low-cost digital cameras. These factors contribute to the ability of autonomous robots to perform complex, real-time, tasks using visual sensors.
Such applications are often based on a local feature (also “interest point”) matching algorithm, which finds feature correspondences between two images. In recent years, there has been an increase in research on algorithms that use local, invariant features (for a survey see [23, 19]). These features are usually invariant to image scale and rotation and also robust to changes in illumination, noise and minor changes in viewpoint. In addition, these features are usually distinctive and easy to match against a large database of local features.
Image matching using local features has been in use for almost three decades. The term “interest point” was first introduced by Moravec in 1979 [20], who later proposed the use of a corner detector for stereo matching [21]. The Moravec detector was improved by Harris and Stephens [10], who used it for efficient motion tracking and 3D structure from motion recovery [9]. The Harris corner detector has since been used widely for many other image matching tasks.
Although extensively used, the Harris corner detector is claimed to be very sensitive to changes in image scale, so it does not provide a good basis for matching images of different sizes. There are many works that deal with representations that are stable under scale change, dating back to 1983 when Crowley and Parker [4] developed a representation that identified peaks and ridges in scale space and linked these into a tree structure which could be matched between images of different scales. More recently, Lindeberg conducted a comprehensive study of this problem [14] and suggested a systematic approach to feature detection with automatic scale selection [15].
A decade ago, Lowe [16] introduced the Scale Invariant Feature Transform (SIFT), which had a significant impact on the popularity of local features. SIFT descriptors are invariant to a substantial range of affine distortion, changes in a 3D viewpoint, noise and illumination differences. Robust matching is possible between different views of an object or a scene, in the presence of clutter and occlusion. Since when SIFT was published, several new algorithms inspired by SIFT have emerged, including PCA-SIFT [12], GLOH [18] and SURF [1].
SURF (Speeded-Up Robust Feature) [1], which is incorporated herein by reference in its entirety, is a state of the art algorithm for local invariant feature matching—a scale and rotation invariant interest point detector and descriptor. SURF is composed of three major steps, similar to SIFT, but uses faster feature detection/extraction algorithms. SURF is known to be faster to compute than SIFT, while allowing for comparable results.