1. Technical Field
The present invention relates to a method of determining a position and orientation of a device, wherein the position and orientation is determined based on multiple degrees of freedom and the device is associated with a capturing device for capturing at least one image. The invention also relates to a computer program product comprising software code sections for performing the method.
2. Background Information
Localization of a moving capturing device, such as a camera, or a moving device equipped with a capturing device, such as a camera, with respect to objects in a coordinate system attached to the objects (defining a reference origin and a reference orientation of the axes of the objects) is an important task in the field of computer vision. Many different approaches have been proposed, which use a different system setup, use different sources of input data and/or are processing the input data in different ways.
One big class includes vision-based localization, in which the data captured from one or multiple cameras, such as but not limited to visual light cameras, infrared cameras, time-of-flight cameras, depth camera systems, scanning systems or any other system providing some kind of image from the objects to be localized to, are analyzed and used for alignment with already known or during runtime learned representations of the objects. The proposed method in this application according to the invention, as set out below, can be applied to any of the previously mentioned capturing devices.
The representations of the objects might be, but are not limited to markers, blue prints, images, templates, textured and non-textured polygonal models, textured or non-textured CAD models, feature point maps, point clouds, line, segment, silhouette, or depth representations. The image of the system can be analyzed in different ways to extract information, such as but not limited to intensities, gradients, edges, lines, segments, corners, descriptive features or any other kind of features, primitives, histograms, polarities or orientations. The proposed method in this application according to the invention, as set out below, can be applied to any of the previously mentioned object representation and can use any of the previously mentioned extracted image information.
Other approaches make use of data from sensors attached to the camera, such as but not limited to compass, GPS, inertial sensor, accelerometer, gyroscope, magnetometer, odometer, mechanical sensors like rotary encoder, or results from tracking systems such as measuring arms or laser tracker. These sensors either provide measurements directly with respect to the coordinate system of the objects the camera needs to be localized to or are integrated in calibrated systems which provide this data after some processing of the raw sensor data and potentially additional information of the system. The proposed method in this application according to the invention, as set out below, can be implemented to use any of the previously mentioned sensors.
Another class of approaches for localization of a moving camera or device is outside-in systems, which determine the pose of the device from outside. These systems can be rigidly registered to the objects coordinate system or can dynamically be localized with respect to the objects coordinate system themselves. The moving camera or device to be localized might, but does not have to be attached with active or passive tools or markers, such as but not limited to visible light or infrared markers or laser reflectors, which are recognized by the corresponding outside-in system and used for localization.
A broad field of further approaches is combining the different systems, sensors and approaches in a procedural or integrated way.
Within the vision-based localization the edge-based approaches use representations of the objects which result in a set of but not limited to edges, lines, gradients, segments, borders, silhouettes, contours, edgelets, orientations and/or polarities in an image of the object captured with the camera system to be localized.
Edge based approaches have the advantage of being robust to illumination changes and light conditions, work on poorly textured objects and are usually robust to partial occlusion. The object information needed for edge based localization can be extracted manually, semi-automatic or fully automatic from different sources of representations of the objects, such as but not limited to textured or non-textured polygon or CAD models or scenes, blue prints, images, feature maps, point clouds, models of lines, segments, or silhouettes, depth representations or scans.
A standard approach of these systems works by matching correspondences in the image for the known representation of the objects and perfoiming an optimization, such as but not limited to a least squares minimization, on these correspondences to estimate the position and the orientation of the camera. This matching and optimization procedure is generally embedded into an iterative framework, which performs the matching based on an initial pose which is updated during optimization and its update is used as another initial pose in the next iteration of the optimization. After a certain number of iterations, the pose estimation can converge to the true pose to be found.
The known representation of the objects is projected into the camera image based on known camera intrinsic parameters and a first rough localization, which can be provided but is not limited to the last frame in a frame-to-frame tracking system, see for example FIG. 1. The example of FIG. 1 shows a projection of a 3D line model (digital representation R of object OB) based on a rough initial camera localization, in which one orientation (here the gravity) is reliably provided. This pose could be computed directly from given GPS, compass and accelerometer sensor data of the capturing device. While GPS and compass are not reliable, the gravity is provided sufficiently accurate for the final pose and thus does not need to be optimized.
Based on a given camera pose C of form
  C  =      [                            R                          t                                      0                          1                      ]  where R is a 3×3 rotation matrix and t is a 3×1 translation vector, a homogenous 3D point x of form x=(x, y, z, 1)T is projected into the image to point (u, v)T with function
      (                            u                                      v                      )    =      proj    ⁡          (      Cx      )      (referenced as equation 1 in the later text), where function proj(.) models the projection from camera to image coordinates based on known camera intrinsic parameters.
Correspondences of the projected representation of the objects in the image are searched by sampling the resulting projection representation such as but not limited to edges, lines, borders or silhouettes to tracking nodes, edgelets or sample points and for each of them searching within some search range within their neighborhood, such as but not limited to a search along their noimal, for gradient maxima, see FIG. 2. While some approaches keep the nearest gradient maxima as correspondence pixel for the projected point (e.g., See T. Drummond, R. Cipolla. Real-time tracking of complex structures with on-line camera calibration. British Machine Vision Conference, 1999), others take the biggest gradient maxima (e.g., See A. I. Comport, E. Marchand, M. Pressigout, F. Chaumette. Real-Time Markerless Tracking for Augmented reality: The Virtual Visual Servoing Framework. Transactions on Visualization and Computer Graphics, 2006; hereinafter referred to as “Comport”). To be robust against motion blur some might search for intensity ramps instead of intensity steps; e.g., See G. Klein, D. Murray. Improving the Agility of Keyframe-Based SLAM. European Conference on Computer Vision, 2008; hereinafter referred to as “Klein”). In Tamaazousti (i.e., M. Tamaazousti, V. Gay-Bellile, S. N. Collette, S. Bourgeois, M. Dhome. Real-Time Accurate Localization in a Partially Known Environment: Application to Augmented Reality on textureless 3D Objects. TrakMark, 2011; referred to hereinafter as “Tamaazousti”) the nearest gradient maxima with an almost similar orientation is kept as correspondence for each point of the projection into the registered images (keyframes) of a bundle adjustment system. The projection into the registered images is using a computed camera pose obtained from a visual tracking algorithm that requires a set of consecutive images with small inter-image displacements. The bundle adjustment provides 6 degrees of freedom estimations for each of these images.
In Tamaazousti the full 6 degrees of freedom pose is assumed to be of high confidence, which allows additional checks like the view dependent orientation check. Additionally, the approach presented in Tamaazousti requires a set of registered images obtained from the tracking of consecutive images with small inter-image displacement.
The distance dj between each projected point (u, v)j and its found correspondence in the image is a single measurement to be optimized.
Based on the pose C of the current iteration used for projection of the representation of the objects into the image an update transformation T is computed, such that the updated camera pose C′=TC minimizes the distance d between the set of m reprojected points of the representation of the objects and their matched image correspondences.
This transformation update T is parameterized by the six vector a corresponding to the exponential map parameterization of the Lie group se(3):
      T    ⁡          (      a      )        =      expm    ⁡          (                        ∑                      i            =            1                    6                ⁢                              a            i                    ⁢                      A            i                              )      (referenced as equation 2 in the later text) with expm(.) being the exponential map, a=[a1 a2 a3 a4 a5 a6], a1 to a3 representing the rotation and a4 to a6 representing the translation of T. The corresponding generator matrices Ai of the group can be chosen as the following matrices (referenced as equations 3 in the later text):
            A      1        =          [                                                                  [                                  e                  1                                ]                            x                                            0                                                0                                0                              ]        ,          ⁢            A      2        =          [                                                                  [                                  e                  2                                ]                            x                                            0                                                0                                0                              ]        ,          ⁢            A      3        =          [                                                                  [                                  e                  3                                ]                            x                                            0                                                0                                0                              ]        ,          ⁢            A      4        =          [                                    0                                              e              1                                                            0                                0                              ]        ,          ⁢            A      5        =          [                                    0                                              e              2                                                            0                                0                              ]        ,          ⁢            A      6        =          [                                    0                                              e              3                                                            0                                0                              ]      with e1, e2, e3 being the canonical basis for R3 and [.]x being a skew symmetric matrix as
            [                                    x                                                y                                                z                              ]        x    =      [                            0                                      -            z                                    y                                      z                          0                                      -            x                                                            -            y                                    x                          0                      ]  The partial differentiation of T(a) around the origin (a=0) as needed for minimization is
                    ∂                                              ∂                  a          i                      ⁢          T      ⁡              (        a        )              =      A    i  The Jacobian matrix J is obtained by the differentiation of the projection of points into the image (see equation 1) with respect to a
      J          j      ,      i        =            ∂              d        j                    ∂              a        i            where the Jacobian J is of dimension m×6.To find the parameters of the transformation update a the following equation is solvedJa=d where d is the m-dimensional vector of single distance measurements dj.For standard least-squares optimization the solution in a given iteration takes the following forma=(JTJ)−1JTd To be robust against outliers a robust M-estimator can be used to solve for the transformation update (e.g., See Comport; and C. Wiedemann, M. Ulrich, C. Steger. Recognition and Tracking of 3D Objects. Deutsche Arbeitsgemeinschaft für Mustererkennung, 2008; and L. Vacchetti, V. Lepetit, P. Fua. Combining edge and texture information for real-time accurate 3D camera tracking. International Symposium on Augmented and Mixed Reality, 2004; referred to hereinafter as “Vacchetti”).
In hybrid systems, see FIG. 7A, using additional sensor data, such as GPS, accelerometer, gyroscope and magnetometer the sensor data might be used for initialization of a vision based approach (e.g., See J. Karlekar, S. Z. Zhou, W. Lu, Z. C. Loh, Y. Nakayama, D. Hii. Positioning, Tracking and Mapping for Outdoor Augmentation, International Symposium on Augmented and Mixed Reality, 2010; referred to hereinafter as “Karlekar”), or combined with the results from the optical tracking by sensor fusion using e.g. extended Kalman filter; e.g., See G. Reitmayr, T. W. Drummond. Going out: Robust Model-based Tracking for Outdoor Augmented Reality, International Symposium on Augmented and Mixed Reality, 2006; referred to hereinafter as “Reitmayr”. The sensor fusion based on Kalman filters requires an estimation of statistics like covariance matrices of the sensors. As set out below, the present invention does not require such estimations and it is based on a completely different approach.
Proposed solutions to the limitations of the standard approaches:
Whether the pose optimization based on correspondences between the known representation of the objects and their matched representation in the image will successfully converge to a correct camera pose highly depends on the initial pose used as starting point for the localization, the used pose estimation approach and the correctness of the correspondences. False correspondences can result from but is not limited to noise in the image, occlusion of the object in the image, undersized search range or false choice of the correspondences due to multiple reasonable matching candidates within the used description space for comparison. The probability of the latter one increases with the search range in the image in which a correspondence needs to be searched. This limits the offset between the initial cameras pose used as starting point for the localization and the correct cameras pose to be found by the approach for which a correct localization can be performed.
Different approaches try to overcome a small search range by increasing the correctness of correspondences by allowing multiple hypotheses for correspondences and adapting the pose estimation such that it will choose the best correspondence during optimization of the 6 degrees of freedom pose; e.g., See Vacchetti and H. Wuest, D. Stricker. Tracking of industrial objects by using CAD models, Journal of Virtual Reality and Broadcasting, Volume 4, 2007. Other approaches try to improve the description of the gradient maxima to increase the reliability of the matching process by e.g. using the polarity of the gradient (e.g., See Klein).
In summary, existing approaches for vision-based localization are not robust when the localization is performed with respect to a complex object within a complex scene, i.e. they generally fail when localizing a camera attached to mobile device in an outdoor environment. For instance, and more practically, state-of-the-art methods do not solve the problem of localizing a camera with respect to a building façade with known model and the appearance of which has partly changed (e.g. due to open/closed windows/doors, different painting of part of it, changed trees structure in its neighborhood over the seasons) since the registration based on visual data fails as it falls into incorrect local minima during the alignment algorithm.
It would therefore be beneficial to provide a more robust method of determining a position and orientation of a device based on multiple degrees of freedom, with the device being associated with a capturing device for capturing at least one image, which is capable to avoid the aforementioned disadvantages.