The present invention concerns machine vision and in particular, a correlation-based method and apparatus for processing a group of images of a scene to concurrently and iteratively obtain estimates of ego-motion and scene structure.
In machine vision applications, the motion of the camera rig moving through an environment (ego-motion) provides useful information for tasks such as navigation and self-localization within a map. Similarly, recovering the structure of the environment may be useful for tasks such as obstacle avoidance, terrain-based velocity control, and three-dimensional (3D) map construction. In general, the problems of estimating ego-motion and structure from a sequence of two-dimensional (2D) images are mutually dependent. Prior accurate knowledge of ego-motion allows structure to be computed by triangulation from corresponding image points. This is the principle behind standard parallel-axis stereo algorithms, where the baseline is known accurately from calibration. In this case, knowledge of the epipolar geometry provides an efficient mechanism for determining scene structure by searching for corresponding points.
If, on the other hand, prior information is available regarding the structure of the scene, then the ego-motion can be computed directly. Essentially, the space of all possible poses of the camera is searched for the pose for which the perspective projection of the environment onto the image plane most closely matches each image in the sequence of images. The ego-motion is the path from one pose to the next.
It is more difficult to obtain accurate estimates of both ego-motion and structure by analyzing a sequence of 2D images in which neither is known. Generally, algorithms that attempt to perform this function fall into two classes: (i) those that use the epipolar constraint and assume that the motion field is available, and (ii) those that utilize the "positive depth" constraint, also known as "direct" algorithms. A correlation-based approach is described in an article by M. Irani et al. entitled "Robust multi-sensor image alignment," Proceedings of the Sixth International Conference on Computer Vision (ICCV'98), pp. 959-965, January 1998. This system uses correlation to align images obtained from multiple modalities (e.g. from visible and IR cameras). The described method, however uses a 2D motion model. In addition, Beaudet masks are employed to estimate first and second derivatives of the correlation surface.
Another system is described in an article by K. J. Hanna et al. entitled "Combining Stereo and Motion Analysis for Direct Estimation of Scene Structure," Proceedings of the International Conference on Computer Vision, pp. 357-365, 1993. The system described in this paper relies on the image-brightness constraint and stereo image processing techniques to align images and determine scene structure.