1. Technical Field
The present invention relates to a method of providing a descriptor for at least one feature of an image and to a method of matching features of two or more images. Moreover, the invention relates to a computer program product comprising software code sections for implementing the method according to the invention.
2. Background Information
Many applications in the field of computer vision require finding corresponding points or other features in two or more images of the same scene or object under varying viewpoints, possibly with changes in illumination and capturing hardware used. The features can be points, or a set of points (lines, segments, regions in the image or simply a group of pixels). Example applications include narrow and wide-baseline stereo matching, camera pose estimation, image retrieval, object recognition, and visual search.
For example, Augmented Reality Systems permit the superposition of computer-generated virtual information with visual impressions of a real environment. To this end, the visual impressions of the real world, for example captured by a camera in one or more images, are mixed with virtual information, e.g., by means of a display device which displays the respective image augmented with the virtual information to a user. Spatial registration of virtual information and the real world requires the computation of the camera pose (position and orientation) that is usually based on feature correspondences.
A common way, e.g. such as described in David G. Lowe: “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, 60, 2 (2004), pp. 91-110, to gain such correspondences is to first extract features or interest points (e.g. at edges, corners or local extrema) from the individual images that have a high repeatability. That is, the probability that the same sets of pixels corresponding to the same physical entities are extracted in different images is high. The second step is then to create a descriptor for each feature, based on the intensities of its neighborhood pixels, that enables the comparison and therefore matching of features. The two main requirements for a good descriptor are distinctiveness, i.e. different feature points result in different descriptors, and invariance to
1) changes in viewing direction, rotation and scale,
2) changes in illumination,
3) image noise.
This is to ensure that the same feature in different images will be described in a similar way with respect to a similarity measure. To address the invariance against rotation, a spatial normalization transforms the pixels of the local neighborhood around a feature point to a normalized coordinate system prior to the construction of the descriptor.
It is critical to the invariance that this normalization is reproducible. More advanced methods exist, but in the simplest case the normalization only consists of an in-plane rotation according to the feature orientation. The orientation is usually defined based on the pixel intensities in the neighborhood of a feature point, e.g. as the direction of the largest gradient. Ideally the pixels in the normalized neighborhood of a feature are identical for different images taken with varying viewing direction, rotation and scale. In practice, they are at least very similar, cf. FIG. 2.
In FIG. 2, there is shown an exemplary feature point in different scenes 21 and 22. In the first column showing the scenes 21 and 22, the same feature point under two different orientations is shown as feature point F21 in scene 21 and feature point F22 in scene 22. In a next step the orientation is defined based on the pixel intensities in the neighborhood of the respective feature point F21 and F22, in the present example as the direction of the largest gradient (depicted by the white line within the respective rectangle). Then, a spatial normalization transforms the pixels of the local neighborhood around feature points F21 and F22 (in the present case, the pixels within the rectangle) to a normalized coordinate system (depictions 31 and 32 in the second column) prior to the construction of the descriptors d1 and d2 (third column), respectively. As a result, alignment of the descriptors d1, d2 to the largest gradient results in a very similar normalized neighborhood (as shown in depictions 31 and 32) and, therefore, similar descriptors d1 and d2. This property is common among local feature descriptors and referred to as invariance to rotation. Invariance to scale is usually handled by constructing an image pyramid containing the image at different scales and performing the above on every scale level. Other approaches store the scale with every feature descriptor.
A variety of local feature descriptors exist, wherein a good overview and comparison is given in Krystian Mikolajczyk and Cordelia Schmid, “A performance evaluation of local descriptors”, IEEE Transactions on Pattern Analysis & Machine Intelligence, 10, 27 (2005), pp. 1615-1630. Most of them are based on the creation of histograms of either intensity values of the normalized local neighborhood pixels or of functions of them, such as gradients. The final descriptor is expressed as an n-dimensional vector (as shown in FIG. 2 on the right) and can be compared to other descriptors using a similarity measure such as the Euclidian distance.
In FIG. 3, there is shown a standard approach for creating a feature descriptor. In step S1, an image is captured by a capturing device, e.g. a camera, or loaded from a storage medium. In step S2, feature points are extracted from the image and stored in a 2-dimensional description (parameters u, v). In step S3, an orientation assignment is performed as described above with respect to FIG. 2, to add to the parameters u, v an orientation angle a. Thereafter, a neighborhood normalization step S4 is performed, as described above with respect to FIG. 2 to gain normalized neighborhood pixel intensities i[ ]. In the final step S5, a feature descriptor in the form of a descriptor vector d[ ] is created for the respective extracted feature as a function of the normalized neighborhood pixel intensities i[ ]. Approaches exist that may assign multiple orientation angles to a feature in step S3 and consequently carry out the steps S4 and S5 for each orientation resulting in one descriptor per assigned orientation.
A major limitation of the standard approaches as described above is that while invariance to rotation is clearly an important characteristic of local feature descriptors in many applications, it may however lead to mismatches when images contain multiple congruent or near-congruent features, as for instance the four corners of a symmetric window or individual dartboard sections.
In an example, as shown in FIG. 1, a real object 3 which is in the present example a building having a window 4, is captured by a mobile device 1 having a camera on the rear side (not shown). For instance, the mobile device 1 may be a mobile phone having a camera and an optical lens on the rear side for capturing an image of the window 4. On the display 6 of the mobile device 1, the window 4 is depicted as shown. An image processing method extracts features from the displayed image, for example the features F1 to F4 representing the four corners of the window that can be considered as prominent features of the window, and creates a feature descriptor for each of the features F1 to F4. Due to invariance to rotation, as schematically illustrated in FIG. 1 in the left column, an ideal local feature descriptor would describe these features F1 to F4 in exactly the same way making them indistinguishable, as illustrated by the extracted features F1 to F4 depicted in a normalized coordinate system in the left column.
In a real word setting with camera noise and aliasing, the descriptors will not be identical but very similar and therefore virtually indistinguishable. Consequently, the probability of mismatches is very high for such scenes which may result in a complete failure of any system relying upon such local feature descriptors.
A variety of approaches exist that assume all camera images to be taken in an upright orientation and therefore do not need to deal with the orientation. Here congruent or near-congruent features in different orientations can easily be distinguished from each other, but the field of possible applications is very limited since the camera orientation is heavily constraint.
Therefore, it would be beneficial to have a method of providing a descriptor for at least one feature of an image, wherein the descriptor is provided in a way that the probability of mismatches due to congruent or near-congruent features in different orientations on a static object or scene in a feature matching process may be reduced without constraining the orientation or movement of the capturing device or without needing prior knowledge on the orientation or movement of the capturing device.