1. Technical Field
The invention is related to a method for determining correspondences between a first and a second image, for example for use in an optical tracking and initialization process, such as an optical keyframe supported tracking and initialization process. Moreover, the present invention relates to a method for determining the pose of a camera using such method, and to a computer program product comprising software code sections for implementing the method.
2. Background Information
Keyframe-based 3D Tracking is often required in many computer vision applications such as Augmented Reality (AR) applications. In this kind of tracking systems the camera position and orientation are estimated out of 2D-3D correspondences supported through so-called keyframes to allow automatic initialization and re-initialization in case of a lost tracking. This 2D-3D correspondences are often established using CAD models like described in: Juri Platonov, Hauke Heibel, Peter Meier and Bert Grollmann, “A mobile markerless AR system for maintenance and repair”, In: proceeding of the 5th IEEE and ACM International Symposium on Mixed and Augmented Reality.
Keyframes are frames with pre-extracted feature descriptors, and a reliable set of 2D-3D correspondences can therefore be registered into a common coordinate system. By matching extracted feature descriptors of a current camera image (a current image of a real environment taken by a camera) with the available 3D points' feature descriptors of a keyframe, 2D-3D correspondences in the current image can be established and a rough camera pose be estimated. Searching the closest keyframe to the estimated pose and backprojecting the stored 3D points into the current image increases the number of correspondences if the projected points are comparable to the stored 2D appearances in the keyframe. By performing 2D-3D pose estimation, a more accurate camera pose can now be computed to initialize tracking algorithms like KLT or POSIT (as disclosed in: “Pose from Orthography and Scaling with Iterations”—DeMenthon & Davis, 1995).
Recent publications like G. Klein and D. Murray: “Parallel Tracking and Mapping for Small AR Workspaces”, in: Proceeding of the International Symposium on Mixed and Augmented Reality, 2007, have shown the advantage of keyframe based (re)initialization methods. Klein compares a downscaled version of the current image with downscaled keyframes and chooses the image with the best intensity-based similarity as the closest keyframe. Frames from the tracking stage are added as keyframes into the system if many new feature points can be found and the baseline to all the other keyframes is large enough.
When doing 3D markerless tracking a standard approach can be described using the following steps. In this regard, FIG. 4 shows a flow diagram of an exemplary process for keyframe generation:
In Steps 1 and 2, once a set of digital images (one or more images) are acquired, features are extracted from a set of these “reference” digital images and stored. The features can be points, a set of points (lines, segments, regions in the image or simply a group if pixels), etc.
In Step 3, descriptors (or classifiers) may be computed for every extracted feature and stored. These descriptors may be called “reference” descriptors.
In Step 4, the extracted 2D reference features get registered against 3D points by using manual, semi-automatically or full automatically registration methods using online reconstruction methods like SLAM or simply by a known CAD model.
In Step 5, the extracted 2D features and assigned 3D points are getting stored with the digital image in a structure. This structure is called keyframe.
According to FIG. 5, a standard keyframe supported method for initializing and tracking is described which comprises the following steps:
In Step 10, one or more current images are captured by a camera, the pose of which shall be determined or estimated.
In Step 11, for every current image captured, features of the same types used in the keyframes are extracted. These features maybe called “current features”.
In Step 12, descriptors (or classifiers) may be computed for every current feature extracted and stored. These descriptors may be referenced as “current descriptors”.
In Step 13, the current features are matched with the reference features using the reference and current descriptors. If the descriptors are close in terms of a certain similarity measure, they are matched. For example the dot product or Euclidean distance of vector representations can be used as similarity measurement.
In Step 14, given a model of the target, an outlier rejection algorithm is performed. The outlier rejection algorithm may be generally based on a robust pose estimation like RANSAC or PROSAC.
In Step 15, the keyframe providing the highest number of verified matches is selected to be the closest keyframe.
In Step 16, using the 2D coordinates from the current frame and the 3D coordinates indirectly known through the 2D-(2D-3D) matching an initial camera pose can be computed using, e.g., common linear pose estimation algorithms like DLT refined by classical non-linear optimization methods (Gauss-Newton, Levenberg-Marquard).
In Step 17, to improve this computed first guess of the camera pose, not yet matched 3D points from the keyframe may be projected into the current camera image using the computed pose.
In Step 18, the descriptors of all projected points (local patch around the point) get compared against the local descriptors of known 2D points in the current frame (current image). Again based on a similarity measurement method these points are handled as matched or not. Commonly a local 2D tracking like KLT is performed to deal with small displacements.
In Step 19, using all new and before known 2D-3D matches the pose estimation step is again performed to compute a more precise and reliable pose of the camera (Refined Pose RP).
As limitations of the standard approaches, the 3D points projected into the current camera image often get rejected because the displacement is often too large for common 2D tracking algorithms like KLT which is only able to deal with pure translations. Due to the fact that small rotations can be approximated as translations very small rotations can be handled but the algorithm will fail in case of a bigger rotation. Also the descriptors generally handle in-plane rotation, scale and in the best case affine transformations but do not handle perspective distortions, this makes the descriptor-based approaches vulnerable when such distortions are present in the image.
An approach for improving the matching process is described in: Vincent Lepetit, Luca Vacchetti, Daniel Thalmann and Pascal Fua: “Fully Automated and Stable Registration for Augmented Reality Applications”, Proc. Of the Second IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR2003), where the authors are locally approximating the object surface around the interest points by planes to synthetically re-render all patches using the coarse estimated camera pose. All re-rendered patches are used to create a keyframe which is closer to the current frame and therefore allows increasing the number of total matches. To speed up the computation all transformations are approximated to be homographies extracted from the projection matrices given the intrinsic parameters of the used camera.
This approach has the disadvantage that this approximation can only be done by knowing the camera model and an initial guess of the pose which makes the approach not usable in case the camera parameters are unknown and/or when, e.g., the camera intrinsic parameters are planned to be estimated on-line.
Therefore, it would be beneficial to provide a method for determining correspondences between a first and a second image, which is independent from the used camera model and not strongly dependent on the initial guess of the pose.