Finding point correspondences among images of the same object is important for image retrieval, object recognition, scene identification, and 3D shape estimation. Points of interest in an image for the purpose of image retrieval, object recognition and the like are called key points. The key points have to be selected and processed such that they are invariant to image scale and rotation and provide robust matching across a substantial range of distortions, change in 3D viewpoint, noise and change in illumination. Further, in order to be well suited for tasks such as image retrieval and object recognition, the key points have to be distinctive in the sense that a single feature can be correctly matched with high probability against a large database of features from many images.
After, the points of interest, or key points, are detected and located, they are described using various descriptors. Then, the individual features corresponding to the key points and represented by the descriptors are matched to a database of features from known objects. Therefore, a correspondence searching system can be separated into three modules: interest point detector, image point descriptor, and correspondence locator. In these three modules, the descriptor's construction complexity and dimensionality have direct and significant impact on the performance of the system as a whole (e.g. the SIFT descriptor construction costs about ¾ of the total feature extraction time). The discussion that follows focuses on a method for developing a descriptor vector of a keypoint neighborhood.
Several image point descriptors have been proposed in the literature. Scale-invariant feature transform (SIFT) is one type of algorithm used in computer vision for detecting and describing local features in the images. Speeded-up robust features (SURF) is another type of algorithm used for detecting and describing local features in images. Applications of SIFT and SURF include object recognition and 3D reconstruction. The literature also includes comparisons and evaluations of these image point descriptors. According to these comparisons, SIFT and SURF provide similar distinctiveness while SURF is faster in speed and SIFT has fewer damaging artifacts for wide base line image matching. For SIFT, distinctiveness of descriptors is measured by summing the eigenvalues of the descriptors. The sum corresponds to the amount of variance captured by different descriptors, therefore, to their distinctiveness.
FIG. 1 shows a flowchart of a method for constructing a SIFT descriptor.
This flow chart summarizes the SIFT feature computation. The method begins at 1000. At 1001, an input image is received.
At 1002, the input image is gradually Gaussian-blurred to construct a Gaussian pyramid. Gaussian blurring generally involves convolving the original image I(x, y) with the Gaussian blur function G(x, y, ki σ) at scale ki σ such that the Gaussian blurred function L(x, y, ki σ) is defined as L(x, y, ki σ)=G(x, y, ki σ)*I(x, y). Here, ki σ denotes the standard deviation of the Gaussian function that is used for blurring the image. As ki is varied, the standard deviation ki σ varies and a gradual blurring is obtained. The standard deviation of the first blur function is denoted with σ and ki are multipliers that change the standard deviation. When the initial image I is incrementally convolved with Gaussians G to produces the blurred images, the blurred images L are separated by a constant factor ki in the scale space.
At 1003, a difference of Gaussian (DoG) pyramid is constructed by computing the difference of any two consecutive Gaussian-blurred images in the Gaussian pyramid. Thus, in the DoG space, D(x, y, σ)=L(x, y, ki σ)−L(x, y, (ki−1)σ). A DoG image D(x, y, σ) is the difference between the Gaussian-blurred images at scales ki σ and (ki−1)σ. The scale of the D(x, y, σ) lies somewhere between ki σ and (ki−1) σ and as the number of Gaussian-blurred images increase and the approximation provided for the Gaussian pyramid approaches a continuous space, the two scales also approach into one scale. The convolved images L are grouped by octave where an octave corresponds to a doubling of the value of the standard deviation, σ. Moreover, the values of the multipliers ki are selected such that a fixed number of convolved images L are obtained per octave. Then, the DoG images D are obtained from adjacent Gaussian-blurred images L per octave. After each octave, the Gaussian image is down-sampled by a factor of 2 and then the process is repeated.
At 1004, local maxima and local minima in the DoG space are found and the locations of these maxima and minima are used as key-point locations in the DoG space. Finding the local maxima and minima is achieved by comparing each pixel in the DoG images D to its eight neighbors at the same scale and to the nine neighboring pixels in each of the neighboring scales on the two sides, for a total of 26 pixels (9×2+8=26). If the pixel value is a maximum or a minimum among all 26 compared pixels, then it is selected as a key point. After this stage, the key points may be further processed such that their location is identified more accurately and some of the key points, such as the low contrast key points and edge key points may be discarded.
At 1005, each key point is assigned one or more orientations, or directions, based on the directions of the local image gradient. By assigning a consistent orientation to each key point based on local image properties, the key point descriptor can be represented relative to this orientation and therefore achieve invariance to image rotation. The magnitude and direction calculations are performed for every pixel in the neighboring region around the key point in the Gaussian-blurred image L and at the key-point scale. The magnitude of the gradient for a key point located at (x, y) is shown as m(x, y) and the orientation or direction of the gradient for the key point at (x, y) is shown as Theta(x, y). The scale of the key point is used to select the Gaussian smoothed image, L, with the closest scale to the scale of the key point, so that all computations are performed in a scale-invariant manner. For each image sample, L(x, y), at this scale, the gradient magnitude, m (x, y), and orientation, Theta (x, y), are computed using pixel differences according to: m(x, y)=SQRTR[(L(x+1, y)−L(x−1, y))2+(L(x, y+1)−L(x, y−1))2]. The direction Theta(x, y) is calculated as Theta(x, y)=arctan [(L(x, y+1)−L(x, y−1))/(L(x+1, y)−L(x−1,y))]. Here, L(x, y) is a sample of the Gaussian-blurred image L(x, y, σ), at scale σ which is also the scale of the key point.
In practice, the gradients are calculated consistently either for the plane in the Gaussian pyramid that lies above, at a higher scale, than the plane of the key point in the DoG space or in a plane of the Gaussian pyramid that lies below, at a lower scale, than the key point. Either way, for each key point, the gradients are calculated all at one same scale in a rectangular area surrounding the key point. Moreover, the frequency of an image signal is reflected in the scale of the Gaussian-blurred image. Yet, SIFT simply uses gradient values at all pixels in the rectangular area. A rectangular block is defined around the key point; sub-blocks are defined within the block; samples are defined within the sub-blocks and this structure remains the same for all key points even when the scales of the key points are different. Therefore, while the frequency of an image signal changes with successive application of Gaussian smoothing filters in the same octave, the key points identified at different scales are sampled with the same number of samples irrespective of the change in the frequency of the image signal, which is represented by the scale.
At 1006, the distribution of the Gaussian-weighted gradients are computed for each block where each block is 2 sub-blocks by 2 sub-blocks for a total of 4 sub-blocks (In practice, SIFT has to use 4 sub-blocks by 4 sub-blocks for a total of 16 sub-blocks to achieve desired distinctiveness). To compute the distribution of the Gaussian-weighted gradients, an orientation histogram with several bins is formed with each bin covering a part of the area around the key point. The orientation histogram may have 36 bins each covering 10 degrees of the 360 degree range of orientations. Alternatively, the histogram may have 8 bins each covering 45 degrees of the 360 degree range.
Each sample added to the histogram is weighted by its gradient magnitude within a Gaussian-weighted circular window with a standard deviation that is 1.5 times the scale of the key point. Peaks in the orientation histogram correspond to dominant directions of local gradients. The highest peak in the histogram is detected and then any other local peak that is within a certain percentage, such as 80%, of the highest peak is used to also create a key point with that orientation. Therefore, for locations with multiple peaks of similar magnitude, there will be multiple key points created at the same location and scale but different orientations.
At 1007, the histograms from the sub-blocks are concatenated to obtain a feature descriptor vector for the key point. If the gradients in 8-bin histograms from 16 sub-blocks are used, a 128 dimensional feature descriptor vector results. At 1008, the method ends.
In one example, the feature descriptor is computed as a set of orientation histograms on (4×4) blocks in the neighborhood of the key point. Histograms contain 8 bins each, and each descriptor contains a 4×4=16 array of 8-bin histograms around the key point. This leads to a SIFT feature vector with (4×4)×8=128 elements. This vector is normalized to enhance invariance to changes in illumination.
The dimension of the descriptor, i.e. 128, in SIFT is high. However, descriptors with lower dimensions have not performed as well across the range of matching tasks. Longer descriptors continue to perform better but not by much and there is an additional danger of increased sensitivity to distortion and occlusion.
FIG. 2 shows a schematic depiction of constructing a SIFT descriptor.
The steps of the flowchart of FIG. 1 are shown schematically in FIG. 2. For example, the blurring of the image to construct a Gaussian pyramid (1002) and the differencing (1003) is shown in the top left corner, proceeding to computing key points by locating of the local maxima and minima (1004) on top right corner. The calculation of the gradient vectors (1005) is shown in the bottom left corner. The computation of the gradient distribution (1006) in histograms is shown in the bottom right corner. Finally the feature descriptor vector that is a concatenation (1007) of the histograms is also shown in the bottom right corner.
In FIG. 2, the key point 200 is located at a center of the rectangular block 202 that surrounds the key point 200.
The gradients that are pre-computed for each level of the pyramid are shown as small arrows at each sample location 206 at the bottom left (1005). As shown, 4×4 regions of samples 206 form a sub-block 204 and 2×2 regions of sub-blocks form the block 202. The block 202 is also called a descriptor window. The Gaussian weighting function is shown with the circle 220 and is used to assign a weight to the magnitude of each sample point 206. The weight in the circular window 220 falls off smoothly. The purpose of the Gaussian window 220 is to avoid sudden changes in the descriptor with small changes in position of the window and to give less emphasis to gradients that are far from the center of the descriptor. A 2×2=4 array of orientation histograms is obtained from the 2×2 sub-blocks with 8 orientations in each bin of the histogram resulting in a (2×2)×8=32 dimensional feature descriptor vector. However, other studies have shown that using a 4×4 array of histograms with 8 orientations in each histogram (8-bin histograms), resulting in a (4×4)×8=128 dimensional feature descriptor vector for each key point yields a better result.
The feature descriptor vector may be subsequently further modified to achieve invariance to other variables such as illumination.