Tracking image key points across frames is a well studied problem due its usefulness in computer and robotic vision applications such as optical flow [1], [4], object tracking [23], and 3D reconstruction [24]. The interest in feature tracking dates back to [1], where Lucas and Kanade first propose the well known KLT tracker for computing optical flow between spatially and temporally close frames.
The original KLT method assumes a translation model and iteratively estimates the displacement vector using image alignment techniques.
In simultaneous, cameras with wide field-of-view or small size optics became increasingly popular due to their benefits in many computer vision and robotic systems: fish-eye lenses provide a wide field-of-view that proved to be beneficial for tasks like egomotion estimation [25] and visual place recognition [26]; boroscopes are employed in medical endoscopy and industrial inspection for visualizing small cavities with difficult or limited access [27]. However, the projection in these cameras with wide-angle lens presents strong radial distortion (RD) that causes a displacement of the pixel positions along radial direction that increases with the distance to the center of the image, and it is typically described by nonlinear terms that are function of the image radius.
Image alignment techniques applied in a matching or tracking context rely on the assumption of a motion model that determines the type and amount of deformation that is tolerated by the tracker. Several motion models have been used in the literature but none takes into account the RD effect arising in camera devices equipped with the above-mentioned non-conventional optics. In practical terms, the inability of the currently used motion models to accommodate RD translates in more frequent template updates [7], which will introduce localization drifts [5] and, most importantly, affect the tracking accuracy and repeatability [17,7].
Despite of these facts, the KLT tracker has been applied in the past to images with significant RD [28], [29]. Some works directly apply the KLT method directly over RD images and, therefore, violate the underlying assumptions of the KLT tracker, which were done for perspective images. Other solutions used in the literature either discard the image boundaries [29], where the distortion effect is more pronounced, or correct the distortion in a pre-processing step before applying the KLT. Although the later approach is quite straightforward, the distortion rectification requires the interpolation of the image signal, which can be computationally expensive, and, even more important, unreliable since the synthetically corrected images contain artificially interpolated pixel intensities [18], [19].
This invention concerns the extension of the motion models and image alignment techniques, which were originally developed to the perspective, to the case of images with radial distortion. Such an extension is not trivial because, among other reasons, the warp for aligning the image patches depends both on local motion parameters and on global distortion parameters. Thus, all the parameters are interdependent and their estimation cannot be carried separately for each region as it usually happens for the conventional KLT approaches. This raises issues in terms of computational complexity, memory management, and real-time requirements that are overcome by a careful design of the warp models, that enables estimating the parameters using Inverse Compositional Alignment [2,3], and by applying the Schur Complement Method to efficiently compute the parameters' updates at every iteration of the minimization process. The experiments show that our method, henceforth referred as RD-KLT, dramatically improves the accuracy and repeatability of tracking while adding a computational overhead of less than 1% with respect to the equivalent KLT for perspective. In addition, the described invention also enables to calibrate radial distortion and changes in zoom by simply tracking low-level image features in two or more frames. Such automatic calibration from moving points is new and of great practical interest in application in domains like surveillance and medical endoscopy.
1) Related Work
The interest on feature tracking using image alignment techniques dates back to the 80s, when Lucas and Kanade [1] formulated the tracking using a brightness constancy assumption. Such approach as been extensively studied in the past three decades, with several improvements being proposed by computer vision and robotic community [2], [3], [6], [30], [5], [31], [32]. Most of such improvements focused on reducing computational complexity [2], [3], [33], improve tracking in wide baseline situations [6], [30], [5], or manipulating the motion models for increasing robustness against illumination changes [18]. Our invention concerns the extension of these works, which assume geometric correct perspectives, to the case of cameras with radial distortion.
Tracking based in image alignment has been studied in the context of panoramic or omnidirectional cameras in [34], [32], [35]. Mei et al. propose in [34], [32] a region-tracking algorithm for generic central cameras where the warping is formulated on the sphere in order to deal with the non-uniform sampling of catadioptric images. The approach is specific to the tracking of plane surfaces and requires the camera to be calibrated. In [35], Salazar et al. use the warping function proposed in [34,32] to perform homography-based tracking in uncalibrated images. This is accomplished by simply adding the camera intrinsics to the vector of unknown parameters to be estimated. The work of Salazar et al. is still specific to the tracking of large plane surfaces, it involves computationally expensive minimization that precludes real-time performance, and it requires tracking across three or more frames to recover the camera parameters.
Tamaki et al. propose in [36] an image alignment approach to calibrate the camera radial distortion. The method registers a distortion-free planar pattern with a distorted view of this pattern, and uses non-linear optimization to estimate the plane homography under perspective, the radial distortion, and the linear spatial changes in illumination. Like our method, the algorithm just requires two views for computing the warping parameters, but the need of a planar pattern and the requirement of having a distortion-free view of this pattern limits usability.
2) Matching Salient Points in Perspective Images Through the Registration of Local Regions Around Those Points
1. Forward Additive Alignment
Matching salient points in images with a close viewpoint can be formulated as a registration problem whose goal is to perform a non-linear optimization of the sum-of-squared differences between a template region T around the salient point and an incoming images I. The goal is to compute
                    ε        =                              ∑                          x              ∈              𝒩                                ⁢                                    [                                                I                  ⁡                                      (                                          w                      ⁡                                              (                                                  x                          ;                          p                                                )                                                              )                                                  -                                  T                  ⁡                                      (                    x                    )                                                              ]                        2                                              (                  Equation          ⁢                                          ⁢          1                )            where p denotes the components of the image warping function w, and N denotes the integration region of a salient points. Lucas and Kanade [1] proposed to optimize equation (1) by assuming that a current p motion vector is known and iteratively solve for p increments on the warp parameters, with equation (1) begin approximated by
                    ε        =                              ∑                          x              ∈              𝒩                                ⁢                                    [                                                I                  ⁡                                      (                                          w                      ⁡                                              (                                                  x                          ;                                                      p                            +                                                          δ                              ⁢                                                                                                                          ⁢                              p                                                                                                      )                                                              )                                                  -                                  T                  ⁡                                      (                    x                    )                                                              ]                        2                                              (                  Equation          ⁢                                          ⁢          2                )            Differentiating ε with respect to δp, and after some algebraic manipulations, a closed form solution for δp can be obtained:
                                                        δ              ⁢                                                          ⁢              p                        =                                          ℋ                                  -                  1                                            ⁢                                                ∑                                      x                    ∈                    𝒩                                                  ⁢                                                                            [                                                                        ∇                          I                                                ⁢                                                                              ∂                                                          w                              ⁡                                                              (                                                                  x                                  ;                                  p                                                                )                                                                                                                                          ∂                            p                                                                                              ]                                        T                                    ⁢                                      (                                                                  T                        ⁡                                                  (                          x                          )                                                                    -                                              I                        ⁡                                                  (                                                      w                            ⁡                                                          (                                                              x                                ;                                p                                                            )                                                                                )                                                                                      )                                                                                ,                                          ⁢          with                ⁢                                  ⁢                  ℋ          =                                    ∑                              x                ∈                𝒩                                      ⁢                                                            [                                                            ∇                      I                                        ⁢                                                                  ∂                                                  w                          ⁡                                                      (                                                          x                              ;                              p                                                        )                                                                                                                      ∂                        p                                                                              ]                                T                            ⁡                              [                                                      ∇                    I                                    ⁢                                                            ∂                                              w                        ⁡                                                  (                                                      x                            ;                            p                                                    )                                                                                                            ∂                      p                                                                      ]                                                                        (                  Equation          ⁢                                          ⁢          3                )            being a first order approximation of the Hessian matrix, and the parameter vector being additively updated by pi+1←pi+δp at each iteration i. This method is also known as forward additive KLT [2,3] and it requires to re-compute H at each iteration due its dependence with the gradients of the incoming image ∇I.
2. Inverse Compositional Alignment
For efficiently solving equation 1, Baker and Matthews [2,3] proposed an inverse compositional method that starts by switching the roles of T and I, which results in the following cost function
                    ε        =                              ∑                          x              ∈              𝒩                                ⁢                                    [                                                I                  ⁡                                      (                                          w                      ⁡                                              (                                                  x                          ;                          p                                                )                                                              )                                                  -                                  T                  ⁡                                      (                                          w                      ⁡                                              (                                                  x                          ;                                                      δ                            ⁢                                                                                                                  ⁢                            p                                                                          )                                                              )                                                              ]                        2                                              (                  Equation          ⁢                                          ⁢          4                )            
In this case the motion update increments δp are computed as:
                                                        δ              ⁢                                                          ⁢              p                        =                                          ℋ                                  -                  1                                            ⁢                                                ∑                                      x                    ∈                    𝒩                                                  ⁢                                                                            [                                                                        ∇                          T                                                ⁢                                                                              ∂                                                          w                              ⁡                                                              (                                                                  x                                  ;                                  0                                                                )                                                                                                                                          ∂                            p                                                                                              ]                                        T                                    ⁢                                      (                                                                  I                        ⁡                                                  (                                                      w                            ⁡                                                          (                                                              x                                ;                                p                                                            )                                                                                )                                                                    -                                              T                        ⁡                                                  (                          x                          )                                                                                      )                                                                                ,                                          ⁢          with                ⁢                                  ⁢                              ℋ            =                                          ∑                                  x                  ∈                  𝒩                                            ⁢                                                                    [                                                                  ∇                        T                                            ⁢                                                                        ∂                                                      w                            ⁡                                                          (                                                              x                                ;                                0                                                            )                                                                                                                                ∂                          p                                                                                      ]                                    T                                ⁡                                  [                                                            ∇                      T                                        ⁢                                                                  ∂                                                  w                          ⁡                                                      (                                                          x                              ;                              0                                                        )                                                                                                                      ∂                        p                                                                              ]                                                              ,                                    (                  Equation          ⁢                                          ⁢          5                )            and w(x; 0) being the identity warp. H is computed using the template gradients ∇T and, therefore, it is constant during the registration procedure, leading to a significant computational improvement when compared with the forward additive KLT. Finally, the warp parameters are updated as follows:w(x; pi+1)←w(x; pi) ∘w−1(x; δp)   (Equation 6)where ∘ denotes the composition of functions. Although the update rule of the inverse compositional alignment is computationally more costly than a simple additive rule, Baker and Matthews [2,3] show that the overall computational complexity of the inverse formulation is significantly lower than that of the forward additive KLT.
3. Motion Models w for Conventional Perspective Images
The motion model (or image warping function) w used for feature tracking determines the degree of image deformation tolerated during the registration process. The original contribution of Lucas and Kanade [1,4] assumes that the neighborhood N around a salient image point u moves uniformly and, therefore, the authors model the image motion using a simple translation model. However, the deformation that it tolerates is not sufficient when the tracked image region is large, or the video sequence undergoes considerable changes in scale, rotation and viewpoint. In these situations, the affine motion model [2,3,5,6] is typically adopted w=(u; p)=(|+A)u+t where the parameter vector is p=(α1, . . . , α4, tx, ty)T, | is a 2×2 identity matrix, and
  A  =            (                                                  a              1                                                          a              2                                                                          a              3                                                          a              4                                          )        .  Although in the examples of this document we will consider the affine motion model, the same reasoning can be applied to other types of motion models such as the homography [2,3] or the models that deal with illuminations changes [2,3,5]. In this document, we describe how these motion models, that were originally developed for perspective imagery, can be extended to take into account the image distortion effect arising in cameras equipped with unconventional optics that introduce non-linear radial distortion.
4. Template Update for Long-Term Feature Tracking
For long-term feature tracking, the template update is a critical step to keep plausible tracks. An inherent problem to the template update step is the localization error introduced whenever the template is updated [7]. High-order motion models tend to be more flexible in terms of the deformation tolerated during the registration process, with the templates being updated less frequently. This minimizes the drift in the feature localization introduced whenever a new template is captured [3,7]. Since our main goal is to perform feature (position) tracking rather than template (object/appearance) tracking, the window around a feature position is captured whenever the squared error of equation 1 falls above some threshold, as detailed in [5].
5. Pyramidal Tracking for Initializing the Iterative Minimization
Despite of the warp complexity, the registration process may fail to converge when the initialization of the warp parameters p0 is not close enough to the current motion parameters, i.e. p0 is not in the convergence region C where the first order approximation of equation 2 is valid [3]. This affect can be attenuated by performing tracking using a pyramidal image representation [6], where several image resolutions are built by down-sampling the original image signal by factors of 2. A L-levels pyramidal tracking algorithm proceeds from the coarse to the finest pyramid level, with the coarsest feature position being given by xL=2−Lx. The registration proceed at each pyramid level, with the result begin propagated to next level as xL−1=2xL [5,6]. Since the integration region N is kept constant across scales, the pyramidal framework greatly improves the probability of p0 belonging to C, which by consequence increases the tracking success. Typical values for L range from 2 to 5 image levels.
3) Modeling the Projection in Cameras with Radial Distortion
We assume that the effect of the distortion introduced by the camera lens can be described using the so-called first order division model [9,10,19,22], with the amount of distortion in pixel units being quantified by a single parameter ξ that will be henceforth referred as the image distortion parameter. Let x=(x,y)T and u=(u,v)T be corresponding points in distorted and undistorted images expressed with respect to a reference frame with origin in the distortion center O. The function fξ is a vector function that maps points from the distorted image I to its undistorted counterpart Iu according to:u=fξ(x)=(1+ξxTx)−x   (Equation 7)
The function is bijective and the inverse mapping from I to Iu is given by [9, 10]:x=fξ−1(u)=2(1+√{square root over (1−4ξuTu))}−u   (Equation 8)
Given that the radius of the distorted image point x is r=√{square root over (xTx)}, the corresponding undistorted radius is ru=(1ξr2)−1r.
Also relevant for this invention is the relation between the image distortion parameter ξ, which quantifies the effect of distortion in pixel units, and the lens distortion parameter η, that quantifies the distortion in metric units. Following the formulation used in [10], that places the division distortion model before the camera intrinsics, it comes that the relation between distorted image point x and undistorted image points um expressed in metric units is um=fη(K−1x) with K being the matrix of intrinsic parameters
  K  =      (                            af                          sf                                      o            x                                                0                                                    a                              -                1                                      ⁢            f                                                o            y                                                0                          0                          1                      )  where α is the aspect ratio, s the skew, f is the focal length, and o=(ox, oy) is the principal point that coincides with the distortion center [8]. Assuming now that the camera is skewless and has unitary aspect ratio, which is a perfectly plausible assumption for modern CCD cameras, and that the image points x are expressed in a reference frame with origin in o (ox=oy=0), it comes that the relation between x and um can be re-written as:um=f−1(1+f−2ηxTx)−1x 
Taking into account that undistorted points in pixel and metric units are related by um=f−1u, it comes from equation 7 and the result of the above equation that the image distortion parameter and the lens distortion parameter are related by the camera focal length:
                    ξ        =                  η                      f            2                                              (                  Equation          ⁢                                          ⁢          9                )            
This invention is applicable to cameras with lens distortion that are assumed to be skewless and with unitary aspect ratio, that the radial distortion is well described by the division model, and that the center of distortion O, in the absence of further information, is well approximated by the image center [8]. These assumptions are valid for most cameras and lenses in use, including medical endoscopes and cameras equipped with fish-eye lenses, for which all the relations derived above hold.