1. Technical Field
The invention is related to a method of providing a set of feature descriptors configured to be used in matching at least one feature of an object in an image of a camera, and a corresponding computer program product for performing the method.
2. Background Information
Such method may be used among other applications, for example, in a method of determining the position and orientation of a camera with respect to an object. A common approach to determine the position and orientation of a camera with respect to an object with a known geometry and visual appearance uses 2D-3D correspondences gained by means of local feature descriptors, such as SIFT described in D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal on Computer Vision, 60(2):91-110, 2004. In an offline step, one or more views of the object are used as reference images. Given these images, local features are detected and then described resulting in a set of reference feature descriptors with known 3D positions. For a live camera image, the same procedure is performed to gain current feature descriptors with 2D image coordinates. A similarity measure, such as the reciprocal of the Euclidean distance of the descriptors, can be used to determine the similarity of two features. Matching the current feature descriptors with the set of reference descriptors results in 2D-3D correspondences between the current camera image and the reference object. The camera pose with respect to the object is then determined based on these correspondences and can be used in Augmented Reality applications to overlay virtual 3D content registered with the real object. Note, that analogously the position and orientation of the object can be determined with respect to the camera coordinate system.
Commonly, both feature detectors and feature description methods need to be invariant to changes in the viewpoint up to a certain extent Affine-invariant feature detectors as described in K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A comparison of affine region detectors. Int. Journal Computer Vision, 65:43-72, 2005. that estimate an affine transformation to normalize the neighborhood of a feature exist, but they are currently too expensive for real-time applications on mobile devices. Instead, usually only a uniform scale factor and an in-plane rotation is estimated resulting in true invariance to these two transformations only. The feature description methods then use the determined scale and orientation of a feature to normalize the support region before computing the descriptor. Invariance to out-of-plane rotations, however, is usually fairly limited and in the responsibility of the description method itself.
If auxiliary information is available, this can be used to compensate for out-of-plane rotations. Provided with the depth of the camera pixels, the 3D normal vector of a feature can be determined to create a viewpoint-invariant patch, as described in C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 3d model matching with viewpoint-invariant patches (VIP). In Proc. IEEE CVPR, 2008, of the feature. For horizontal surfaces, the gravity vector measured with inertial sensors enables the rectification of the camera image prior to feature description, as described in D. Kurz and S. Benhimane Gravity-Aware Handheld Augmented Reality. In Proc. IEEE/ACM ISMAR, 2011.
If such data is not available, rendering techniques, such as image warping, can be employed to create a multitude of synthetic views, i.e. images, of a feature. For descriptors providing a low invariance to viewpoint variations or in-plane rotations but enabling very fast descriptor matching, such synthetic views are used to create different descriptors for different viewpoints and/or rotations to support larger variations, as described in S. Taylor, E. Rosten, and T. Drummond. Robust feature matching in 2.3 ms. In IEEE CVPR Workshop on Feature Detectors and Descriptors, 2009; M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua. Brief: Computing a local binary descriptor very fast. IEEE Trans. Pattern Anal. Mach. Intell, 34:1281-1298, 2012.
However, with an increasing number of reference feature descriptors, the time to match a single current feature descriptor increases, making real-time processing impossible at some point. Additionally, the amount of reference data, which potentially needs to be transferred via mobile networks, increases which results in longer loading times.
However, with an increasing number of reference feature descriptors, the time to match a single current feature descriptor increases, making real-time processing impossible at some point. Additionally, the amount of reference data, which potentially needs to be transferred via mobile networks, increases which results in longer loading times.
In addition to invariance to spatial transformations resulting from a varying viewpoint, it is also crucial that feature descriptors (and feature classifiers) provide invariance to changes in illumination, noise and other non-spatial transformations. Approaches exist, that employ learning to find ideal feature descriptor layouts within a defined design space, as described in M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1):43-57, 2011, based on a ground truth dataset containing corresponding image patches of features under greatly varying pose and illumination conditions. Analogically, classifiers can be provided with warped patches that additionally contain synthetic noise, blur or similar in the training phase. Thanks to the training stage provided with different appearances of a feature, classifiers in general provide a good invariance to the transformations that were synthesized during training. However, the probabilities that need to be stored for feature classifiers require a lot of memory, which makes them unfeasible for a large amount of features in particular on memory-limited mobile devices.
Using different synthetic views, i.e. images, of an object to simulate different appearances has shown to provide good invariance to out-of-plane rotations. However, the existing methods making use of this result in large amount of descriptor data making them almost unfeasible on mobile devices.
It would therefore be beneficial to provide a method of providing a set of feature de-scriptors which is capable of being used in methods of matching features of an object in an image of a camera applied on devices with reduced memory capacities.