Augmented reality (AR) is a field of computer research which deals with the combination of real world and computer-generated data, where computer graphic objects are blended into real footage in real time. The majority of augmented reality capturing systems operate with predetermined information about the environment of a user. The predetermined information is typically in some form of map. The user is allowed to interact with the environment based on the map. If the map provided is comprehensive, registration can be performed directly and accurately from the map.
Performing registration using a map is a common method used in camera-based augmented reality tracking. One conventional method of creating a comprehensive map is to use fiducial markers densely distributed in user environments during initialization. Unfortunately, creating the map is difficult and time-consuming. Such a map is often created manually by trained technicians.
Methods of tracking hand-held devices, such as a camera, in an unknown environment in real-time are known. Tracking and mapping are typically separated and run in parallel as separate threads on multi-core processors (e.g. a smartphone or desktop computer). The tracking thread deals with the task of robustly tracking erratic hand-held motion in real-time using natural image features. The mapping thread, operating at a lower rate than the tracking thread, produces a three-dimensional (3D) map of point features from a subset of previously observed images called “keyframes”. The map may be refined using bundle adjustment.
One disadvantage of the known methods of tracking hand-held devices is that the methods expand the map very slowly. The tracking methods try to expand the map only when a current frame is added as a new keyframe. Typically, to become a keyframe, the current frame needs to satisfy conditions including: tracking quality is good (e.g. ratio between a number of observed map points and a number of potentially visible map points exceeds a predefined threshold); time since the last keyframe was added exceeds some pre-defined threshold, such as two-thirds of a second or twenty (20) frames; and the camera that captured the frame is a minimum distance away from the nearest camera location associated to a keyframe already in the map. The conditions attempt to provide a suitable baseline for triangulating new map points while also avoiding the addition of redundant keyframes by ensuring some distance between the keyframes. However, such conditions can limit the above methods ability to explore the environment. Further, adding new map points becomes difficult once an initial set of keyframes is captured. Adding new map points becomes difficult since the camera location is likely to be close to at least one of the keyframes already in the map, although the current frame may see a significantly different area of the scene due to rotation (i.e. panning). The difficulty of adding new map points prevents fast and reliable exploration of the environment because newly discovered areas remain unmapped.
In the known methods of tracking hand-held devices, after a new keyframe is added, an existing keyframe is selected to pair with the new keyframe to expand the map. New map points are created from the matching point correspondences by triangulation. The known methods use the closest keyframe for the pairing, which limits the possible stereo baseline separation. The closest keyframe does not necessarily have the largest overlap of viewing area.
Requiring a minimum distance between keyframes means that simply rotating the camera will not produce a new keyframe. One method adds a new keyframe based on the camera viewing direction. For each keyframe having an associated camera location less than a minimum distance away from the current camera location, the viewing direction of the current frame is compared with that of the keyframe. If the angle between the viewing directions is larger than a predefined threshold, the current frame is also added as a new keyframe.
To ensure valid triangulation and maximize the number of new map points, one known method firstly determines the closest point of intersection of the camera viewing vectors. The distance between the point of intersection and the camera locations is then compared to scene depths for the keyframes. The difference between expected point depth and actual depth is used as a quality measure. A small difference suggests that the camera is looking at a similar area of the scene, and therefore a keyframe with the lowest difference is used for pairing. An alternative method selects the keyframe that has the highest number of matching correspondences with the current frame.