The present disclosure is related to a method and system for determining spatial coordinates of a 3D reconstruction of at least part of a first real object at absolute spatial scale.
Computer vision methods that involve analysis of images are often used, for example, in navigation, object recognition, 3D reconstruction, and Augmented Reality applications, to name a few. The images may be captured by a single camera or different cameras. Detection of image features (such as corners and edges) and image feature extraction is a common step in various computer vision methods or algorithms, such as image based recognition, image based tracking, image based reconstruction, image based classification, and image warping. For example, vision based Simultaneous Localization and Mapping (SLAM) is a well-known computer vision method using one or more cameras for reconstructing a real environment and tracking the one or more cameras. Given at least two images captured by one or more cameras, a typical SLAM method comprises feature detection, description, matching, triangulation and (global) map refinement.
It is a commonly known problem that approaches to determine the structure of a real object based on a set of images captured by a monocular capture apparatus result in a reconstruction of the spatial (or geometrical) structure which is up-to-scale. This means the reconstruction uses spatial units for which the scaling factor to absolute spatial units, such as the unit meter, is unknown. In many applications, it is desirable to obtain, e.g. a reconstruction in absolute units, also referred to as “at absolute scale”. This often requires knowledge of at least one distance at absolute scale, for example between parts of the real object or between positions of the camera relative to the real object at the time when the respective images for reconstruction were taken.
Thus, a common problem of various SLAM and SfM systems is that a reconstructed geometrical model of a real environment is up to a scale as an undetermined factor. If the real object is unknown and the poses of the cameras that took the images for reconstruction are also unknown, then it is impossible to determine the absolute spatial scale of the scene. For example, based on two images of a car as shown in FIG. 2a—one taken from the front I(W1), and one from the right I(W2)—it is impossible to tell if it is a real full-size car or if it is a small realistic miniature car. Consequently, it is also impossible to tell if the cameras that took the two images are many meters apart from another (as is the case for a full-size car) or only a few centimeters apart (as is the case for a miniature car). However, if additional information on the absolute spatial scale of either the camera poses (e.g. the two cameras are 2.34 meters apart) or parts of the object (e.g. the car's headlights are 3.45 centimeters apart) is known, the reconstruction can be performed at absolute scale.
In a case where the absolute spatial scale of a scene cannot be determined, the SLAM system may assign a random scale for example by determining initial keyframes from pixel disparity measurements in image space and assuming some generic real-world distance for the baseline between the two corresponding camera poses. Therefore, reconstructed 3D features have coordinates in a coordinate system associated with the geometrical model which has an unknown scale factor relative to absolute coordinates as they are in the real world, e.g. millimeters, centimeters, meters, or inches. Further, camera positions computed based on the recovered geometrical models are also up to the scale, see reference [4].
The undetermined scale factor introduces challenges to determine true camera movements at absolute scale in, for example, vision-based navigation of a robot system or a vehicle, and to correctly overlay virtual visual information to the real environment in an image of a camera in Augmented Reality applications. As an example, a vision-based navigation application may be able to determine the shape of the camera motion (e.g. that the camera is moving on a circular path), but it cannot determine translational parts (e.g. distances or positions) at absolute scale, e.g. if the radius of the circle is 1 meter or 10 meters. As another example, consider an Augmented Reality application that superimposes a virtual piece of furniture spatially registered on a live video feed of the environment. If camera tracking is performed in a coordinate system with a random (i.e. arbitrary) scale, then also the superimposed virtual furniture will have an arbitrary scale. A virtual 2 meters high cupboard could look three times as high as a 1 meter high table or it could look half as high as that table, depending on the arbitrary scale that was chosen during reconstruction. Obviously, this is not desirable. Instead, a virtual 2 meters high cupboard should appear twice as high as a 1 meter high real table next to it. The real and the virtual objects in the camera augmented by superimposition should be consistent in terms of scale. In order to enable this, the (correct) absolute scale of the geometrical model of the real environment is desired to be known.
Also, in a situation in which multiple geometrical models of multiple real objects have been separately created using the same vision-based SLAM system for tracking the multiple real objects simultaneously, like in reference [8], the problem of undetermined scale factors is quite significant. Typically, random scale values are applied to each of the multiple geometrical models. If the SLAM system switches between the geometrical models, the scale may change and, therefore, the user experience in computer vision applications like Augmented Reality is seriously affected.
Various methods have been proposed for determining correct scale factors that could define true sizes of reconstructed geometrical models of real environments as they are in the real world.
For example, Davison et al. in reference [1] propose to introduce calibration objects with known absolute spatial dimensions into the scene for determining absolute scale in SLAM systems. Thereby they need to change the appearance of the scene because they use the same camera to capture the calibration objects and to capture the scene to reconstruct in SLAM. Also the user has to have the calibration objects available.
Lemaire et al. in reference [5] propose to use a stereo camera system (i.e. two cameras with displacement with an overlapping camera frustum) to solve the problem of determining absolute scale in SLAM systems. However, using a stereo camera is only a partial remedy, since the displacement between the two cameras has to be significant in relation to the distance to the environment or object in order to reliably compute depth of the environment. Also the displacement between the two cameras needs to be known at absolute scale, i.e. in units such as millimeters, centimeters, meters, or inches.
Also approaches for estimating absolute scale using multi-camera set-ups with non overlapping camera frustums are disclosed in reference [14]. However, the displacement between the two cameras has to be significant in relation to the distance to the environment or object in order to reliably compute depth of the environment.
Lieberknecht et al. in reference [6] integrate depth information into monocular vision-based SLAM to allow correctly scaled geometrical model reconstruction by employing an RGB-D camera that provides absolute depth information related to image pixels. It is possible to determine absolute scale from known depth information at absolute scale. However, an RGB-D camera device is not commonly available in a hand-held device, e.g. mobile phone, tablet computer, or PDA, compared to a normal monocular RGB camera. Also active stereo-based depth cameras, that are based on projecting infrared light into the scene, do not work reliably if there is significant infrared environment light, as the case for outdoor environment during daylight.
Klein et al. in reference [7] solve the problem of scale estimation by manually defining a baseline (i.e. the distance at absolute scale) between the two positions of a camera while it captured the two images needed for 3D triangulation, which is used to reconstruct the environment.
Sensor fusion with an Inertial Measurement Unit (IMU) could also be used to estimate the absolute scale, as disclosed in reference [9]. One problem with this approach is the inaccuracy of the sensor values resulting in inaccurate scale estimates. Expensive (i.e. calculation intensive) techniques like “Kalman Filtering” or “Bundle Adjustment” are used to address the problem, but usually the accuracy of the IMUs integrated in off-the-shelf devices, such as mobile phones, is not sufficient to estimate absolute scale accurately.
Therefore it would be desirable to provide a method and system for determining spatial coordinates of a 3D reconstruction of at least part of a first real object at absolute spatial scale which are capable of reconstructing real objects at absolute scale or determining a scale factor which maps coordinates of a reconstruction at an arbitrary scale to absolute scale.