1. Technical Field
The invention is related to incremental motion estimation techniques, and more particularly to an incremental motion estimation system and process for estimating the camera pose parameters associated with each image of a long image sequence using a local bundle adjustment approach.
2. Background Art
Estimating motion and structure of an object from long image sequences depicting the object has been a topic of interest in the computer vision and the computer graphics fields for some time. For example, determining the motion and structure of objects depicted in a video of a scene containing the objects is of particular interest. The estimates are used for a variety of applications. For example, estimates of the structure of an object depicted in the consecutive images can be used to generating a 3D model of an object. Estimating the motion of an object in an image sequence is useful in background/foreground segmentation and video compression, as well as many other applications. A key part of estimating motion and structure from a series of images involves ascertaining the camera pose parameters associated with each image in the sequence. The optimal way to recover the camera pose parameters from long image sequences is through the use of a global bundle adjustment process.
Essentially, global bundle adjustment techniques attempt to simultaneously adjust the camera pose parameters associated with all the images such that the predicted 3D location of a point depicted in multiple images coincides. This typically involves the minimization of re-projection errors. However, bundle adjustment does not give a direct solution, rather it is a refining process and requires a good starting point. This starting point is often obtained using conventional incremental approaches. In general, incremental approaches attempt to estimate the camera pose parameters of an image in a sequence using the parameters computed for preceding images.
A good incremental motion estimation process is also very important for applications other than just supplying a good starting point for the global bundle adjustment. For example, in many time-critical applications, such as visual navigation, there simply is not enough time to compute camera pose parameters using global bundle adjustment techniques, which are relatively slow processes. In such cases, the estimation of the camera pose parameters can be achieved using faster incremental approaches, albeit with potentially less accuracy. Additionally, when an image sequence is dominated by short feature tracks (i.e., overlap between successive images is small), the global optimization techniques degenerate into several weakly correlated local processes. In such cases, the aforementioned incremental methods will produce similar results with less processing. Still further, in some computer graphics applications, local consistency is more important than global consistency. For example, due to errors in calibration and feature detection, a global 3D model may not be good enough to render photorealistic images. Approaches such as “view-dependent geometry” which rely on a local 3D model may be preferable. Incremental methods can be employed as part of the process of generating these local 3D models.
There are two main categories of incremental techniques. The first is based on Kalman filtering. Because of the nonlinearity between motion-structure and image features, an extended Kalman filter is typically used. The final result then depends on the order in which the image features are supplied, and the error variance of the estimated motion and structure is usually larger than the bundle adjustment.
The second category is referred to as subsequence concatenation. The present system and process falls into this category. One example of a conventional subsequence concatenation approach is the “threading” operation proposed by Avidan and Shashua in reference [1]. This operation connects two consecutive fundamental matrices using the tri-focal tensor. The threading operation is applied to a sliding window of triplets of images, and the camera matrix of the third view is computed from at least 6 point matches across the three views and the fundamental matrix between the first two views. Because of use of algebraic distances, the estimated motion is not statistically optimal. Fitzgibbon and Zisserman [2] also proposed to use sub-sequences of triplets of images. The difference is that bundle adjustment is conducted for each triplet to estimate the trifocal tensor and successive triplets are stitched together into a whole sequence. A final bundle adjustment can be conducted to improve the result if necessary. Two successive triplets can share zero, one or two images, and the stitching quality depends on the number of common point matches across six, five or four images, respectively. The number of common point matches over a sub-sequence decreases as the length of the sub-sequence increases. This means that the stitching quality is lower when the number of overlapping images is smaller. Furthermore, with two overlapping images, there will be two inconsistent camera motion estimates between the two images, and it is necessary to employ an additional nonlinear minimization procedure to maximize the camera consistency. It is also noted that both of the subsequence concatenation processes described above rely on point matches across three or more views. Point matches between two views, although more common, are ignored.
It is noted that in this background section and in the remainder of the specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.