The task of finding correspondences between two images of the same scene or object is part of many computer vision applications. Camera calibration, 3D reconstruction (i.e. obtaining a 3D image from a series of 2D images which are not stereoscopically linked), image registration, and object recognition are just a few. The search for discrete image correspondences can be divided into three main steps. First, ‘interest points’ are selected at distinctive locations in the image. The most valuable property of an interest point detector is its repeatability, i.e. whether it reliably finds the same interest points under different viewing conditions. Next, the neighbourhood of every interest point is represented by a descriptor. This descriptor has to be distinctive and at the same time robust to noise, detection errors and geometric and photometric deformations. Finally, the descriptors are matched between different images. The matching is often based on a distance between the vectors, e.g. the Mahalanobis or Euclidean distance.
A wide variety of detectors and descriptors have already been proposed in the literature (e.g. [1-6]). Also, detailed comparisons and evaluations on benchmarking datasets have been performed [7-9].
The most widely used interest point detector probably is the Harris corner detector [10], proposed in 1988, and based on the eigenvalues of the second-moment matrix. However, Harris corners are not scale invariant. In [1], Lindeberg introduced the concept of automatic scale selection. This allows detection of interest points in an image, each with their own characteristic scale. He experimented with both the determinant of the Hessian matrix as well as the Laplacian (which corresponds to the trace of the Hessian matrix) to detect blob-like structures. Mikolajczyk and Schmid refined this method, creating robust and scale-invariant feature detectors with high repeatability, which they coined Harris-Laplace and Hessian-Laplace [11]. They used a (scale-adapted) Harris measure or the determinant of the Hessian matrix to select the location, and the Laplacian to select the scale. Focusing on speed, Lowe [12] proposed to approximate the Laplacian of Gaussians (LoG) by a Difference of Gaussians (DoG) filter. Several other scale-invariant interest point detectors have been proposed. Examples are the salient region detector, proposed by Kadir and Brady [13], which maximises the entropy within the region, and the edge-based region detector proposed by Jurie et al. [14]. They seem less amenable to acceleration though. Also several affine-invariant feature detectors have been proposed that can cope with wider viewpoint changes.
An even larger variety of feature descriptors has been proposed, like Gaussian derivatives [16], moment invariants [17], complex features [18, 19], steerable filters [20], phase-based local features [21], and descriptors representing the distribution of smaller-scale features within the interest point neighbourhood. The latter, introduced by Lowe [2], have been shown to outperform the other [7]. This can be explained by the fact that they capture a substantial amount of information about the spatial intensity patterns, while at the same time being robust to small deformations or localisation errors. The descriptor in [2], called SIFT for short, computes a histogram of local oriented gradients around the interest point and stores the bins in a 128-dimensional vector (8 orientation bins for each of 4×4 location bins).
Various refinements on this basic scheme have been proposed. Ke and Sukthankar [22] applied PCA on the gradient image. This PCA-SIFT yields a 36-dimensional descriptor which is fast for matching, but proved to be less distinctive than SIFT in a second comparative study by Mikolajczyk et al. [8] and a slower feature computation reduces the effect of fast matching. In the same paper [8], the authors have proposed a variant of SIFT, called GLOH, which proved to be even more distinctive with the same number of dimensions. However, GLOH is computationally more expensive, as it uses again PCA for data compression. The SIFT descriptor still seems the most appealing descriptor for practical uses, and hence also the most widely used nowadays. It is distinctive and relatively fast, which is crucial for on-line applications. Recently, Se et al. [4] implemented SIFT on a Field Programmable Gate Array (FPGA) and improved its speed by an order of magnitude. However, the high dimensionality of the descriptor is a drawback of SIFT at the matching step.
For on-line applications, each one of the three steps (detection, description, matching) has to be fast. Lowe proposed a best-bin-first alternative [2] in order to speed up the matching step, but this results in lower accuracy.