The present disclosure is related to a method of tracking a mobile device comprising at least one camera in a real environment, and to a method of generating a geometrical model of at least part of a real environment using image information from at least one camera of a mobile device comprising receiving image information associated with at least one image captured by the at least one camera.
Camera pose estimation and/or digital reconstruction of a real environment is a common and challenging task in many applications or fields, such as robotic navigation, 3D object reconstruction, augmented reality visualization, etc. As an example, it is known that systems and applications, such as augmented reality (AR) systems and applications, could enhance information of a real environment by providing a visualization of overlaying computer-generated virtual information with a view of the real environment. The virtual information can be any type of visually perceivable data such as objects, texts, drawings, videos, or their combination. The view of the real environment could be perceived as visual impressions by user's eyes and/or be acquired as one or more images captured by a camera held by a user or attached on a device held by a user.
A task of camera pose estimation is to compute a spatial relationship or a transformation between a camera and a reference object (or environment). Camera motion estimation is to compute a spatial relationship or a transformation between a camera at one position and the camera at another position. Camera motion is also known as camera pose which describes a pose of a camera at one position relative to the same camera at another position. Camera pose or motion estimation is also known as tracking a camera. The spatial relationship or transformation describes a translation, a rotation, or their combination in 3D space.
Vision based methods are known as robust and popular methods for computing a camera pose or motion. The vision based methods compute a pose (or motion) of a camera relative to an environment based on one or more images of the environment captured by the camera. Such vision based methods are relying on the captured images and require detectable visual features in the images.
Computer Vision (CV) based Simultaneous Localization and Mapping (SLAM) is a well-known technology for determining the position and/or orientation of a camera relative to a real environment and creating a geometrical model of the real environment without requiring any pre-knowledge of the environment. The creation of the geometrical model of the real environment is also called the reconstruction of the environment. Vision based SLAM could facilitate many applications, such as navigation of a robot system or a mobile system. Particularly, it is a promising technology that would support mobile Augmented Reality (AR) in an unknown real environment.
Most SLAM systems have to be initialized in order to get an initial part of the environment model. The initialization has to be done with a distinct movement of the camera between acquiring two images of the real environment. The distinct movement requires that the two images are captured from two distinct camera locations with a sufficient displacement compared to the distance to the environment. Note that rotation-only camera motion produces a degenerated result. It is one of major limitations for using a SLAM device in AR, particularly in hand-held or mobile AR where it is definitely not user-friendly to require a user to move the device a certain way in order to make the system work. Rotation-only camera movement is a natural motion for the users to look around in a real environment and often occurs in many AR applications. However, the rotation-only camera motion may produce a degenerated result for monocular SLAM.
Furthermore, a single camera does not measure metric scale. Another limitation for using monocular SLAM systems in AR is that recovered camera poses and the geometrical model of the environment are up to a scale as an undetermined factor. The undetermined scale factor introduces challenges to correctly overlay virtual visual information to the real environment in an image of the camera.
Nowadays, geometrical models of many cities or buildings are available from 3D reconstruction or from their blueprints. However, most of these models are not up to date due to a frequent development or change of city constructions. Particularly, parking lots usually do not have geometrical models or up-to-date models, as parked cars change from time to time.
Various monocular vision based SLAM systems have been developed for AR applications and particularly for mobile hand-held AR applications. Common challenges and limitations for their use include initialization of the SLAM systems and determination of metric scale factors. The initialization of the SLAM systems requires a distinct movement of the camera for acquiring two images of a real environment such that the two images are captured from two distinct camera locations with a sufficient displacement compared to the distance to the environment. The quality of camera pose estimation and any generated geometrical model definitely depends on the initialization.
Achieving a distinct movement of the camera for a qualified SLAM initialization is especially challenging in hand-held AR applications where users who hold the camera may not be aware of the importance of the camera movement and even have difficulties to realize the distinct movement. Therefore, it is desirable to simplify the initiation or even make it invisible to the users.
Furthermore, a single camera does not measure metric scale. The camera pose and reconstructed environmental model from monocular vision based SLAM is up to an undetermined scale factor. A correct scale factor defines the true camera pose and the size of the reconstructed environmental model as they are in the real world.
The first well-known monocular vision based SLAM system is developed by Davison et al. They require a camera having sufficient displacement between acquiring images for each newly observed part of areal environment. For determining correct metric scale factors, they introduce additional calibration object with known geometrical dimension.
Lemaire et al propose to use a stereo camera system to solve the problem of requiring camera movements and determining scale factors. However, using a stereo camera is only a partial remedy, since the displacement between the two cameras has to be significant in relation to the distance to the environment in order to reliably compute depth of the environment. Thus, a hand-held stereo system would be unable to completely solve the problem, and the requirement of the user to provide additional distinct movement may be still indispensable.
Lieberknecht et al. integrate depth information into monocular vision based SLAM to allow a correctly scaled camera pose estimation by employing a RGB-D camera that provides depth information related to image pixels. It is possible to determine a scale factor from known depth information. However, a RGB-D camera device is not commonly available in a hand-held device, e.g. mobile phone or PDA, compared to a normal RGB camera. Further, common low-cost RGB-D cameras which should be candidates for integration into hand-held devices are typically based on infrared projection, such as the Kinect system from Microsoft or Xtion Pro from Asus. These systems are off-the-shelve commodity cheap consumer devices.
U.S. Pat. Nos. 8,150,142 B2 and 7,433,024 B2 describe detailed ways of a possible implementation of an RGB-D sensor. However, these systems have problems when used outdoors at daytime due to sunlight.
Gauglitz et al. develops a camera pose estimation and environment model generation system that could work for general camera motion and rotation-only camera motion. For rotation-only motion, their method creates a panoramic map of a real environment instead of a 3D geometrical model of the real environment.