Solutions for feature-based 3D tracking have been proposed for augmented reality applications. The majority of those approaches run in a known environment and require fast re-initialization; i.e., providing the system with the initial pose parameters of the camera. Real-time estimation of the camera's initial pose relative to an object, however, remains an open problem. The difficulty stems from the need for fast and robust detection of known objects in the scene given their 3D models, or a set of 2D images, or both.
In computer vision applications, local image features have proven to be very successful for object detection. Those methods usually reduce the problem to a wide-baseline stereo matching problem, where local feature regions associated with key or interest points are first extracted independently in each image and then characterized by photometric descriptors for matching.
Local descriptors can be computed efficiently. A redundant combination of such descriptors provides robustness to partial occlusion and cluttered backgrounds. Ideally, the descriptors are distinctive and invariant to viewpoint and illumination variations. Many different descriptors have been proposed. The Scale Invariant Feature Transform (SIFT) descriptor introduced by Lowe has been shown to be one of the most efficient descriptors with good performance in image retrieval tasks.
In the computer vision arts, a number of feature detectors are known to be invariant to scale and affine transformations. For example, K. Mikolajczyk et al. have compared the performance of several state-of-the-art affine region detectors using a set of test images under varying imaging conditions (K. Mikolajczyk et al., A Comparison of Affine Region Detectors, IJCV (2004)). The comparison shows that there does not exist one detector that systematically outperforms the other detectors for all scene types. The detectors are rather complementary; some are more adapted to structured scenes and others to textures.
Given an arbitrary scene or target object, however, it is unclear which type of feature is more appropriate for matching, and whether performance depends on viewing direction.
The problem of 3D object detection for pose estimation has been addressed by a number of approaches. Some methods are based on statistical classification techniques, such as Principle Component Analysis (PCA), to compare the test image with a set of calibrated training images. Other techniques are based on matching of local image features. Those approaches have been shown to be more robust to viewpoint and illumination changes. They are furthermore more robust against partial occlusion and cluttered backgrounds.
While some approaches use simple 2D features such as corners or edges, more sophisticated approaches rely on local feature descriptors that are insensitive to viewpoint and illumination changes. Usually, geometric constraints are used as verification criteria for the estimated pose.
An affine-invariant image descriptor for 3D object recognition has been described (K. Mikolajczyk & C. Schmid, An Affine Invariant Interest Point Detector, 42 2002 ECCV at 128142). Photometric- and geometric-consistent matches are selected in a pose estimation procedure based on the RANSAC algorithm introduced by Fishler and Bolles in 1981. Although that method achieves good results for 3D object detection, it is too slow for real-time applications.
Other rotation invariant local descriptors have been proposed. For example, D. Lowe, Object Recognition from Local Scale Invariant Features, 1999 ICCV at 11501157, proposes a method for extracting distinctive scale and orientation invariant key points. The distinctiveness is achieved by using a high-dimensional vector representing the image gradients in the local neighborhood for each key point.
Among the prior art descriptors, the SIFT descriptor has been shown to be particularly robust. SIFT descriptors, however are high dimensional (128-dimensional) and computationally expensive to match.
E. Rothganger, S. Lazebnik, C. Schmid. & J. Ponce, 3D Object Modeling and Recognition Using Affine-Invariant Patches and Multiview Spatial Constraints, 2003 Proc. of Conference on Computer Vision and Pattern Recognition, propose a 3D object modeling and recognition approach for affine viewing conditions. The objects are assumed to have locally planar surfaces. A set of local surface patches from the target object is reconstructed using several images of the object from different viewpoints. The 3D object is then represented with the set of respective local affine-invariant descriptors and the spatial relationships between the corresponding affine regions. For matching, variants of RANSAC based pose estimation algorithms are used, minimizing the back-projection error in the test image.
That method has two drawbacks. First, it is not computationally efficient. Second, representing every surface patch with a single descriptor may yield poor matching performance where there are wide viewpoint changes in the test image.
Other approaches use dimension reduction techniques, such as PCA, in order to project high-dimensional samples onto a low-dimensional feature space. PCA has been applied to the SIFT feature descriptor. It has been shown that PCA is well-suited to representing key point patches, and that that representation is more distinctive and therefore improves the matching performance. It is furthermore more robust and more compact than the 128-dimensional SIFT descriptor.
Y. Lepetit, J. Pilet, & P. Fua, Point Matching as a Classification Problem for Fast and Robust Object Pose Estimation, 2004 Proceedings of Conference on Computer Vision and Pattern Recognition, treat wide baseline matching of key points as a classification problem, where each class corresponds to the set of all possible views of such a point. By building compact descriptors, the detection of key features can be done at run-time without loss of matching performance.
Conventional methods for object detection/recognition are based on local, viewpoint-invariant features. For example, some conventional feature detection methods rely on local features that are extracted independently from both model and test images. Another technique is descriptor computation characterization by invariant descriptors. Feature detection methods may also use 2D-2D or 2D-3D matching of the descriptors, or 2D-3D pose estimation.
Several problems exist with those conventional approaches. They are limited by the repeatability of the feature extraction and the discriminative power of the feature descriptors. Large variability in scale or viewpoint can lower the probability that a feature is re-extracted in the test image. Further, occlusion reduces the number of visible model features. Extensive clutter can cause false matches.
Each of those problems may result in only a small number of correct matches. It is typically very difficult to match correctly even with robust techniques such as RANSAC.
There is therefore presently a need to provide a method and system for fast object detection and pose estimation using 3D and appearance models. The technique should be capable of increased speed without sacrificing accuracy. To the inventors' knowledge, there is currently no such technique available.