It is sometimes desired to allow a person to combine virtual reality with a real world surface. For example, a user may point the camera of a computer or smart-phone or tablet at a wall and the wall will appear on the display of the camera device. In addition, some virtual object will appear in the image as well, and appear as if it is part of the real world environment, such as, for example, a basketball hoop appearing to be affixed to the wall. The virtual content may include labels, 3D models, shading, illumination, and the like This is referred to as Augmented Reality (“AR”). In order for the view of the real world and the virtual scene to align properly (i.e. to be properly registered), the pose (i.e. 3D position and orientation) and other properties of the real and virtual cameras must be the same.
Estimating the pose of an object relative to the real world is a task of an AR system. Many different AR tracking methods and systems are available in the current art, including mechanical, magnetic, ultrasonic, inertial, and vision-based, as well as hybrid methods and systems, which combine the advantages of two or more technologies. The availability of powerful processors and fast frame-grabbers has made vision-based tracking methods desirable for various purposes due to their accuracy, flexibility, and ease of use.
When triggering an application in a camera on a certain surface in order to see augmented information on that surface it is possible to calculate the relative camera pose between the pose of the camera when the application was triggered to its poses in consecutive camera frames. When the information presented is 3D content it is useful to register the camera frames correctly (which are 2D by nature). Small errors in the 2D registration may be reflected in large misalignments of the 3D content.
The registration process used for Augmented Reality on planar surfaces is known as planar tracking or homography tracking. In the past, planar tracking or homography tracking has been done in contexts such as aligning different patches taken from space satellites. In Augmented Reality the goal in many cases is displaying 3D content registered in real time to a real world surface or environment. One prior art approach tries to identify strong local features in the image (such as corners) and track those local features as the camera is moving to register the image. With a sizable amount of local features on the real world surface, it is possible to track the plane reliably and in real time. The local features approach can only work on surfaces that are well textured which limits the usability of the application.
Another approach (sometime called the direct approach) tries to use all the pixels in the image and match between frames. The methods using the direct approach tend to be computationally intensive and are typically unable to deal with significant illumination changes. In addition, the approach has been limited in the number of degrees of freedom (DOF) that are available.
Six degrees of freedom registration means the relation between the camera and the planar surface on which information is being augmented is practically the full range of motions one can expect and in particular: moving the camera up and down, left and right, forward and backward and tilting it both in rotation and skewed angles with respect to the surface being imaged. The same applies the other way around meaning moving the surface with respect to the camera. 2DOF registration accommodates only for a limited set of motions and in particular up and down and left and right. Different degrees of freedom can be defined in between these two but only 6DOF supports the full set of motions that can be done in reality.
Fiducial-based vision-based tracking is popular in AR applications due to the simplicity and robustness that such tracking offers. In the prior art, fiducials are physical objects of predefined shape (and possibly size), and are usually integrated with an identification mechanism for uniquely recognizing individual fiducials. Fiducials are placed in a scene and the camera position is calculated according to their locations in the images.
Another approach is called Natural-Feature Tracking (NFT). NFT methods rely on certain features found in the real world. However, the natural features that can be used should have some easily identified and somewhat unique characteristics. Thus, NFT methods limits tracking to highly-textured objects or environments in which prominent scene features can be robustly and quickly located in each frame. NFT methods usually exhibit increased computational complexity compared with fiducial-based methods, as well as reduced accuracy, since little is assumed about the environment to be tracked. NFT methods are less obtrusive and can provide more natural experiences. Nevertheless, such methods are difficult to use for creating natural user-interfaces.
Planar shapes have also been used for tracking in the prior art. Ruiz et al. (hereinafter referred to as Ruiz 2006) (Alberto Ruiz, Pedro E. Lo{acute over (p)}ez de Teruel and Lorenzo Fernández, “Robust Homography Estimation from Planar Contours Based on Convexity”, European Conference on Computer Vision, pp. 107-120, 2006.) proposed a projective approach for estimating the 3D pose of shape contours. An invariant-based frame construction is used for extracting projective invariant features from an imaged contour. The features are used for constructing a linear system of equations in homogeneous coordinates that yields the camera pose. Although theoretically general, the construction proposed in Ruiz 2006 limits the scope of usable shapes by several assumptions on shape concavities, and limits the use of the method in AR applications. In addition, only sparse features are used in Ruiz 2006 for pose estimation, with no error minimization step for increasing the accuracy of the pose estimated.
Iterative optimization has been shown to be useful for tracking, as well as for refining given pose estimates. Fitzgibbon (hereinafter referred to as Fitzgibbon 2001) (Andrew W. Fitzgibbon, “Robust registration of 2D and 3D point sets”, In Proc. British Machine Vision Conference, volume II, pp. 411-420, 2001) proposed a 2D registration method for point sets based on the Levenberg-Marquardt nonlinear optimizer. As pointed out in Fitzgibbon 2001, direct nonlinear optimization on point sets can be easily extended to incorporate a robust estimator, such as a Huber kernel, which leads to more robust tracking. Such a method can also account for curves as sets of points, although the method makes no use of the connectivity information offered by such curves.
A shape footprint, originally proposed by Lamdan et al. (hereinafter referred to as Lamdan 1988) (Lamdan, Y., Schwartz, J. T., and Wolfson, H. J., “Object Recognition by Affine Invariant Matching”, Computer Vision and Pattern Recognition, pp. 335-344, 1988.)), is a construction that can be used for calculating a signature for a shape. Shape footprints have been proposed for the recognition of flat and rigid objects undergoing affine transformations.