The present invention is related to a method of determining a similarity transformation between first and second coordinates of 3D features, which comprises providing a first plurality of 3D features having first coordinates in a first coordinate system which is associated with a first geometrical model of a first real object, with the first plurality of 3D features describing physical 3D features of the first real object, providing a second coordinate system, and providing image information associated with a plurality of images captured by at least one camera.
Such method may be used, for example, in relation with Vision based Simultaneous Localization and Mapping (SLAM), such as disclosed in Davison, Andrew J., et al., “MonoSLAM: Real-time single camera SLAM,” Pattern Analysis and Machine Intelligence, IEEE Transactions on 29.6 (2007): 1052-1067, which is a well-known technology for creating a geometrical model of a real environment using one or more cameras without requiring any pre-knowledge of the environment. The geometrical model that has at least depth information is also referred to as a 3D map of the real environment. The creation of the geometrical model of the environment is also called the reconstruction of the real environment. The created (or typically called reconstructed) geometrical model could be represented by a plurality of 3D features, such as point features and edge features. The 3D features describe physical 3D features of the real environment. A real environment may also be called a real object, or may be understood to comprise one or more real objects.
The geometrical model (or 3D map) of the real environment can be created using triangulation of 2D observations shared in a plurality of images captured by one or more cameras. The triangulation is a common method used in 3D reconstruction based on camera images, see Hartley, Richard, and Andrew Zisserman, “Multiple view geometry in computer vision,” Vol. 2, Cambridge, 2000.
A pose of a camera describes a spatial relationship or a rigid transformation between the camera at a particular position and a reference coordinate system. The reference coordinate system may be associated with a real object or with the camera at another position. The spatial relationship or the rigid transformation describes at least one translation, or at least one rotation, or their combination in 3D space.
The reconstructed geometrical model can be used for determining a pose (i.e. position and/or orientation) of a camera. By matching extracted 2D features of a current camera image with 3D features existing in the geometrical model a plurality of 2D-3D correspondences can be established. Then, the camera position and orientation in a coordinate system of the geometrical model can be computed based on the correspondences. Camera pose estimation is also known as tracking a camera.
Vision based SLAM facilitates many applications, such as vision based navigation of a robot system or a vehicle. Particularly, it is a promising technology that could support Augmented Reality (AR) systems or applications (see Azuma, Ronald, et al., “Recent advances in augmented reality,” Computer Graphics and Applications, IEEE 21.6 (2001): 34-47) in an unknown real environment.
A common problem of various SLAM systems is that a reconstructed geometrical model of a real environment is up to a scale as an undetermined factor. In this case, the SLAM systems may assign a random scale. Therefore, reconstructed 3D features have scaled coordinates in a coordinate system associated with the geometrical model compared to true coordinates as they are in the real world. Further, camera positions computed based on the recovered scaled geometrical models are also up to the scale, see Strasdat, Hauke, J. M. M. Montiel, and Andrew J. Davison,” Scale drift-aware large scale monocular SLAM,” Proceedings of Robotics: Science and Systems (RSS), Vol. 2, No. 3, 2010.
The undetermined scale factor introduces challenges to determine true camera movements in, for example, vision based navigation of a robot system or a vehicle, and to correctly overlay virtual visual information to the real environment in an image of a camera in AR applications.
Particularly, in a situation in which multiple geometrical models of multiple real objects are created using the same vision based SLAM system for tracking the multiple real objects simultaneously, like in Castle, Robert, Georg Klein, and David W. Murray, “Video-rate localization in multiple maps for wearable augmented reality,” Wearable Computers, 2008, ISWC 2008, 12 IEEE International Symposium on, IEEE, 2008, the problem of undetermined scale factors is quite significant. Typically, random scale values are applied to each of the multiple geometrical models. If the SLAM system switches between the geometrical models, the scale may change and, therefore, the user experience in computer vision applications like Augmented Reality is seriously affected.
A correct scale factor may be applied in defining true sizes of geometrical models of real environments or real objects and true camera poses as they are in the real world.
Various methods have been proposed for determining correct scale factors that could define true sizes of reconstructed geometrical models of real environments as they are in the real world.
For example, Davison et al. in Davison, Andrew J., et al., “MonoSLAM: Real-time single camera SLAM,” Pattern Analysis and Machine Intelligence, IEEE Transactions on 29.6 (2007): 1052-1067, propose to introduce calibration objects with known geometrical dimension for determining correct scale factors for SLAM systems.
Lemaire at al. in Lemaire, Thomas, et al., “Vision-based slam: Stereo and monocular approaches,” International Journal of Computer Vision 74.3 (2007): 343-364 propose to use a stereo camera system to solve the problem of determining scale factors in SLAM systems. However, using a stereo camera is only a partial remedy, since the displacement between the two cameras has to be significant in relation to the distance to the environment in order to reliably compute depth of the environment.
Lieberknecht et al. in Lieberknecht, Sebastian, et al., “RGB-D camera-based parallel tracking and meshing,” Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium on, IEEE, 2011, integrate depth information into monocular vision based SLAM to allow correctly scaled geometrical model reconstruction by employing a RGB-D camera that provides depth information related to image pixels. It is possible to determine a scale factor from known depth information. However, a RGB-D camera device is not commonly available in a hand-held device, e.g. mobile phone, PDA, compared to a normal RGB camera.
Klein et al. in Klein, Georg, and David Murray, “Parallel tracking and mapping for small AR work-spaces,” Mixed and Augmented Reality, 2007, ISMAR 2007, 6th IEEE and ACM International Symposium on, IEEE, 2007, solve the problem of scale estimation by manually defining a baseline between a camera capturing two images needed for 3D triangulation.
Sensor fusion with an Inertial Measurement Unit (IMU) could also be used to estimate the scale, as disclosed in Mitzi, Gabriel, et al. “Fusion of IMU and vision for absolute scale estimation in monocular SLAM.” Journal of intelligent & robotic systems 61.1-4 (2011): 287-299. One problem with this approach is the inaccuracy of the sensor values and therefore expensive (i.e. calculation intensive) techniques like “Kalman Filtering” or “Bundle Adjustment” are needed to solve the problem.