1. Technical Field
The present invention relates to a method of determining reference features for use in an optical object initialization tracking process and to an object initialization tracking method making use of reference features, for example extracted from a reference image. Moreover, the present invention relates to a computer program product comprising software code sections for implementing the method according to the invention.
2. Background Information
Augmented Reality Systems permit the superposition of computer-generated virtual information with visual impressions of a real environment. To this end, the visual impressions of the real world are mixed with virtual information, e.g. by means of a semi-transmissive data display worn on the head of a user. The blending-in of virtual information or objects can be effected in context-dependent manner, i.e. matched to and derived from the respective environment viewed. As virtual information, it is basically possible to use any type of data, such as texts, images etc. The real environment is detected e.g. with the aid of a camera carried on the head of the user.
When the person using an augmented reality system turns his or her head, tracking of the virtual objects with respect to the changing field of view is necessary. The real environment may be a complex apparatus, and the object detected can be a significant member of the apparatus. During a so-called tracking operation, a real object detected during an object initialization process may serve as a reference for computing the position at which the virtual information is to be displayed or blended-in in an image taken up by the camera. Due to the fact that the user may change his or her position and orientation, the real object has to be subjected to continuous tracking in order to display the virtual information at the correct position in the display device also in case of an altered position and/or altered orientation of the user. The effect achieved thereby is that the information, irrespective of the position and/or orientation of the user, is displayed in the display device in context-correct manner with respect to reality. An augmented reality system in this regard is an example of the utilization of such so-called markerless tracking systems.
Standard Tracking Initialization Approach:
When doing markerless tracking of a certain target given one or multiple reference images of that target, the standard tracking initialization framework can be described using the following steps. In this regard, FIG. 1 shows a flow diagram of an exemplary process in which the numbers of the following steps are denoted in parentheses.
Once a set of digital images (one or more images) are acquired:                1—Features are extracted from a set of these “reference” digital images and stored. These features are commonly referred to as “reference features” and may be denoted with where i is in {1,2, . . . , NR} and NR is the number of reference features extracted. The features can be points, a set of points (lines, segments, regions in the image or simply a group of pixels), etc.        2—Descriptors (or classifiers) may be computed for every reference feature extracted and stored. These descriptors may be called “reference” descriptors.Then, when having the real target facing the camera that captures live or so-called “current” images:        3—For every current image captured, features are extracted. These features may be called “current features”.        4—Descriptors (or classifiers) may be computed for every current feature extracted and stored. These descriptors may be referred to as “current descriptors” and may be denoted with cj, where j is in {1,2, . . . , NC} and NC is the number of current features extracted.        5—The current features are matched with the reference features using the reference and current descriptors: if the descriptors are relatively close in terms of a certain similarity measure, they are matched. For example, if every descriptor is written as a vector of numbers, when comparing two descriptors, one can use the Euclidian distance between two corresponding vectors as similarity measure. A match is denoted as mk={rk,ck} where k is in {1,2, . . . , NM} and is the NM number of matched features.        6—Given the model of the target, an outlier rejection algorithm is performed. The outlier rejection algorithm may be generally based on a robust pose estimation (explained below).        7—Using the correct matches, the “current” pose of the camera is computed.        
Most of the approaches for feature-based tracking initialization perform a robust estimation in order to remove incorrect matches. This step is called outlier rejection (see above Step 6). This is due to the fact that whatever descriptor or classifier used there is no way to avoid having outliers, i.e. features that are matched incorrectly. Robust estimation allows discarding the outliers from the pose estimation.
A standard approach is disclosed in: M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Communications of the ACM 24: 381-395, June 1981. The standard approach is based on an algorithm that performs the following two steps iteratively: a) the algorithm picks randomly a sample of minimum number of features (also called Sample Set) needed to compute the parameters of a certain transformation model. This transformation can generally be described using a matrix; e.g. one can use 4 points in case the pose is computed via a homography matrix estimation, one can use 5 points in case the pose is computed via an essential matrix estimation, etc.; and b) it estimates the transformation parameters and counts the number of matches (also called Consensus Set) that verify them. To decide whether a match mk={rk,ck} verifies the transformation parameters one can, for example, transform the reference feature rk from the reference image into the current image with this estimated transformation parameters and compute the distance between the current feature ck and the transformed reference feature. A match is considered verifying the transformation parameter set when the distance is smaller than a certain threshold Tm.
The algorithm performs a number NI of iterations and searches for the best transformation parameter set allowing the highest number of matches verifying that parameter set (the highest cardinality of the Consensus Set). If the number of matches corresponding to the best parameter set exceeds a certain threshold Nm, the matches in the Consensus Set verifying the parameter set are considered as inliers (correct matches) and the other matches are considered as outliers (incorrect matches). The condition that the number of matches corresponding to the best parameter set exceeds Nm is generally used to validate the success of the tracking initialization process. Only in the case of a successful tracking initialization process one can determine whether a match is inlier or outlier.
Limitations of the Standard Approaches:
Both the standard framework (performing Steps 1 to 7 as explained above with respect to FIG. 1) and the algorithm taking place in Step 6 and performing the outlier rejection generally give good results. However, it happens that the reference images and the current images are acquired a) using different cameras (different sensors and image qualities); b) under different condition of the target (object dirty or slightly modified); c) under different lighting conditions (the object is brighter or darker in the images); and d) under very different viewpoints, etc.
This results in a very weak matching process (Step 5) since the descriptors of the features used cannot be discriminative in such conditions. In fact, the difference of the environment, of the object to be tracked or of the relative position affects the feature extraction and the feature description.
Also, it is common that the reference images are the result of an acquisition that was performed under very good or optimal conditions or even instead of using real captures of the object to be tracked as reference images, one uses as reference images screenshots of the rendering of the virtual version of the object. It is also common to use point clouds or geometries extracted from the real object (or scene) by various means (for example laser scanners coupled or not with camera or 3D cameras or Time-of-Flight cameras) as reference features. Therefore, in general, much more details can be seen in the reference images (and that cannot be seen in the live captures, i.e. in the current images) and there are usually much more reference features than current features. This often results in the following facts: The number of the reference features is very high.
This results in the matching process (Step 5) becoming inefficient and too slow for real-time or mobile applications. Only a small ratio of the reference and the current features are in common. Only a small ratio of the common features have close descriptors.
This results in that the outlier rejection algorithm (Step 6) does not work or becomes also very slow because of the high number of outliers: in hard cases, it either fails or it needs a very high number NI of iterations in order to be able to select from the random sampling one correct set of inliers. Also, it happens when the threshold Tm used to consider a match as inlier is too high, the algorithm picks the wrong inliers' set.
Already Proposed Solutions:
One approach for improving the matching process is described in M. Grabner, H. Grabner, and H. Bischof, “Learning features for tracking”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minn., USA, June 2007, where the authors learn feature classifiers and compute weights depending on the temporal appearances and matches. They update the feature descriptors over time. Their method is based on online feature ranking based on measures using the distributions of object and background pixels. The feature ranking mechanism is embedded in a tracking system that adaptively selects the top-ranked discriminative features for tracking. The top-ranked features are the ones that best discriminate between object and background classes.
Another approach for improving the outlier rejection algorithm is as follows: In order to improve the result of the standard outlier rejection algorithm, it is possible to either rank or weigh the Consensus Set based on the matching strength or to give prior probabilities to the Sample Set (like in O. Chum and J. Matas, “Matching with PROSAC—progressive sample consensus”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, Calif., USA, June 2005) also based on the matching strength. The matching strength generally used is based on how good the similarity measure between the descriptors of two matched features is.
It would therefore beneficial to provide a method of determining reference features for use in an optical object initialization tracking process and an object initialization tracking method making use of reference features which are capable to reduce at least some of the above mentioned limitations of standard approaches.