A challenge to enabling Augmented Reality (AR) on mobile phones or other mobile platforms is the problem of detecting and tracking objects in real-time. Object detection for AR applications has very demanding requirements: it must deliver full six degrees of freedom, give absolute measurements with respect to a given coordinate system, be very robust and run in real-time. Of interest are methods to compute camera pose using computer vision (CV) based approaches, which rely on first detecting and, subsequently, tracking objects within the camera view. In one aspect, the detection operation includes detecting a set of features contained within the digital image. A feature may refer to a region in the digital image that differs in properties, such as brightness or color, compared to areas surrounding that region. In one aspect, a feature is a region of a digital image in which some properties are constant or vary within a prescribed range of values.
The detected features are then compared to known features contained in a feature database in order to determine whether a real-world object is present in the image. Thus, an important element in the operation of a vision-based AR system is the composition of the feature database. In some systems, the feature database is built pre-runtime by taking multiple sample images of known target objects from a variety of known viewpoints. Features are then extracted from these sample images and added to the feature database.
Recently, augmented reality systems have turned to model-based tracking algorithms or Simultaneous Localization And Mapping (SLAM) algorithms that are based on color or grayscale image data captured by a camera. SLAM algorithms reconstruct three-dimensional (3D) points from incoming image sequences captured by a camera and are used to build a 3D map of a scene (i.e., a SLAM map) in real-time. From the reconstructed map, it is possible to localize a camera's 6 DoF (Degree of Freedom) pose in a current image frame.
In some systems SLAM maps of a target object are generated pre-runtime and in close distance from the object. In runtime, the generated SLAM maps of the object are used to estimate 6 DoF pose of the camera, relative to the object, from incoming video frames.
In existing methods, tracking performance depends upon the appearance of the object and its size in the camera view. If the target object is small, partially occluded, or lacks distinctive visual features, then the estimated camera pose loses accuracy and can also exhibit significant tracking jitter. In more extreme circumstances, very distant objects and objects that lie outside of the current field of view cannot be tracked at all, so any virtual augmentations registered with the target will also be lost.