Image alignment, or image registration, is an important aspect of many products and systems, including consumer cameras and video cameras, industrial machine vision, medical imaging, surveillance systems, and others. Image alignment is the process of transforming the image data of a current, or target, image and the image data of a reference image into one coordinate system. Image alignment is necessary in order to be able to compare or integrate the data obtained from different measurements. Motion estimation is used as part of the image alignment process.
Pixels are organized into blocks, which are the basic unit of analysis within an image. Motion estimation is a way of describing the difference between two images in terms of where each block of the former image has moved. Each block in a current image is associated with an area in a reference image, that it is well-correlated, using a “motion vector.” The motion vectors may relate to the whole image, referred to as global motion estimation, or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel.
FIG. 1 illustrates motion estimation between two images captured by a camera. An image 8 is of the same scene as an image 2, but the camera has been panned to the right. The effect of panning the camera to the right is that objects in the image 8 are shifted to the left relative to the same objects captured in the image 2. For example, a mountain 7 and a house 9 in the image 8 are shifted to the left relative to a mountain 4 and a house 6 in the image 2.
If these images are aligned by estimating the motion between them, there are numerous possible applications. One such application is to stitch the images together as part of a panoramic image, which can be further extended spatially by continuing the process with more images as the camera pans further right. Another application is to reduce noise by averaging together pixel observations that correspond to the same location in both images. These exemplary applications are simply two among many.
One disadvantage of estimating motion between full-resolution images is that the quantity of pixels to be processed can be prohibitively complex. For this reason, it may be desirable to reduce the size, and therefore the resolution, of the images prior to estimating their motion. Processing fewer pixels leads to a more manageable computational load.
Using smaller images reduces the computational load, but accuracy of the motion estimation (ME) method becomes much more important. For example, suppose that the images are down sampled by a factor of eight in each dimension. If the accuracy of ME is ½-pixel (measured at the resolution at which processing occurs), then processing at the reduced resolution yields accuracy of just ½×8=4 pixels when returned to full resolution. Such accuracy may be unacceptable. Thus, although an ME algorithm with ½-pixel accuracy may provide satisfactory alignment results when applied at full resolution, the same algorithm applied to reduced-resolution images may not be accurate enough. To achieve ½-pixel accuracy measured at full resolution, an ME estimation algorithm must have 1/16-pixel accuracy when applied to images down sampled by eight.
The motion vectors can be measured as localized motion vectors, each motion vector associated with matching blocks between the two images. A motion field is a compilation of all localized motion vectors between the two images. For the two example images, the image 2 and the image 8, shown in FIG. 1, there are no independently moving objects in the scene. As such, the ideal motion field relating the image 8 to the image 2 should be a uniform motion field. FIG. 2 illustrates an ideal motion field relating the image 8 to the image 2 of FIG. 1. Each arrow represents the motion field of a corresponding block in the two images. Each arrow is facing right, which corresponds to the camera panning right between capturing the image 2 and then capturing the image 8. However, in practice, the measured motion field is not ideal. FIG. 3 illustrates an exemplary measured motion field relating the image 8 to the image 2. Incorrect motion vectors (MV) such as those shown in FIG. 3 can occur due to a variety of reasons, such as insufficient image detail or image noise. In general, the motion field of FIG. 3 is observed and measured, but it is desired to estimate the ideal global motion as it appears in FIG. 2. When errors can be as common and large as those shown in FIG. 3, a satisfactory solution requires some sort of robust model-fitting. The model must be robust in the sense that data points with big error, referred to as “outliers”, do not adversely affect the output. Most robust model-fitting methods either explicitly or implicitly try to identify the outliers, and compute a final output based only on the reliable data. Identifying outliers as part of the model-fitting process can be computationally demanding, usually requiring some sort of iterative procedure.
The methods for finding motion vectors can be categorized into pixel based methods (“direct”) and feature based methods (“indirect”). One such “direct” method is phase correlation, which uses a fast frequency-domain approach to estimate the relative translative offset between two similar images or blocks.
Using phase correlation, local motion analysis is performed on each block in an image. Phase correlation estimates motion by considering a window that surrounds the target block. FIG. 4 illustrates an example of applying phase correlation to local blocks of an image pair. The image pair includes a current image and a reference image. Each block has a size B×B. Each block is surrounded by a window of size N×N, where N≧B. In the example of FIG. 4, N=2B. Phase correlation considers an N×N window in both the current image and the reference image, where the windows may be co-located or, in the more general case, an offset may be present for the block in the reference image due to a motion predictor. FIG. 4 shows an example for co-located blocks and their corresponding windows.
Phase correlation consists of the following procedure: First, perform point-wise multiplication of the N×N window in the current frame with window function w[x,y]. For the window function w[x,y], a 2-D separable extension of the Hamming window can be used, whose 1-D definition is:
                              w          ⁡                      [            x            ]                          =                  0.53836          -                      0.46164            ⁢                          cos              ⁡                              (                                                      2                    ⁢                    π                    ⁢                                                                                  ⁢                    x                                                        N                    -                    1                                                  )                                                                        (        1        )            The 2-D window function is w[x,y]=w[x]w[y]. Alternatively, windows other than a Hamming window can be used. Apply the Fast Fourier Transform (FFT) to the result, which yields the complex values G[m,n]. Second, perform point-wise multiplication of the N×N window in the reference frame with window function w[x,y]. Apply the FFT to the result, which yields the complex values F[m,n]. Third, compute:
                              S          ⁡                      [                          m              ,              n                        ]                          =                                            F              ⁡                              [                                  m                  ,                  n                                ]                                      ⁢            G            *                          [                              m                ,                n                            ]                                                                                      F                ⁡                                  [                                      m                    ,                    n                                    ]                                            ⁢              G              *                              [                                  m                  ,                  n                                ]                                                                                    (        2        )            where “*” denotes the complex conjugate, and “|” represents the magnitude of the complex argument. Fourth, compute the inverse FFT (IFFT) of S[m,n] to yield s[x,y], the phase correlation surface. Fifth, identify the K biggest peaks from the phase correlation surface. The indices of these peaks correspond to possible motions that are present between the N×N window in the current frame and the N×N window in the reference frame.
The phase correlation surface s[x,y] is a matrix of values. Each position in the matrix corresponds to a different motion (or shift), and the value at that position is the correlation corresponding to that particular motion. The step of identifying the K biggest peaks from the phase correlation surface finds the position (or motions, or shifts) that have the biggest value (or correlations). A peak position represents the integer component of the motion, because x and y only take integer values.
The peaks in the phase correlation surface represent possible motion vectors. Larger peaks indicate higher correlation, and the largest peak often indicates the dominant motion vector in the N×N window. If the motion vector aligns with the integer pixel grid, then the phase correlation peak is concentrated at a single position. However, if the motion vector is sub-pixel in nature instead, then the phase correlation peak energy is distributed across multiple positions. To determine the precise sub-pixel motion then requires further analysis.
There are a number of conventional methods to determine sub-pel precision from the phase correlation surface. These sub-pel methods can generally be categorized as 1-D methods or 2-D methods. A 1-D method operates on each spatial dimension independently. In other words, the vertical and horizontal sub-pel components of motion are determined separately.
FIG. 5 illustrates notation that is used by the different sub-pel methods. A correlation peak of the phase correlation surface is located at position (xk, yk), with correlation value C0,0. The coefficients Cu,v are taken from the neighborhood of the peak in the correlation surface as Cu,v=s[xk+u, yk+v]. The 2-D sub-pel methods may use some or all of the correlation values in FIG. 5. Some 2-D sub-pel methods use even larger windows, for example using correlation value C−3,−3. Note that due to properties of the FFT, evaluation of indices to the phase correlation surface s[x,y] is performed modulo N.
The 1-D sub-pel methods consider horizontal and vertical sub-pel components independently, and use the correlation values shown in FIG. 6. For example, the horizontal sub-pel component is determined using the correlation values C−1,0, C0,0, and C1,0, and the vertical sub-pel component is determined using the correlation values C0,−1, C0,0, and C0,1.
It is shown by H. Foroosh et al. in “Extension of Phase Correlation to Subpixel Registration” that the phase correlation surface in the presence of translational motion is very well approximated by a sinc function. Derivations in Foroosh et al. lead to a relatively simple 1-D sub-pel method that operates in each spatial direction independently. The method is applied to the neighborhood near the phase correlation peak.
In “Television Motion Measurement for DATV and Other Applications” by G. A. Thomas, a 1-D quadratic function is fit to the three points in the neighborhood of the phase correlation peak (either the horizontal or vertical values shown in FIG. 6). In “Practical Approach to the Registration of Multiple Frames of Video Images” by I. E. Abdou, a 1-D Gaussian function is fit in a similar fashion. Results for the methods of Thomas and Abdou are marginal, because as shown in Foroosh et al., the phase correlation surface is neither quadratic nor Gaussian, and as such the methods are limited because of inappropriate fitting functions. Also, in many cases the 1-D sub-pel methods do not provide as complete of a fit in the peak neighborhood as is possible when using the 2-D sub-pel methods.
In “A Study of Sub-Pixel Motion Estimation Using Phase Correlation” by V. Argyriou et al., the following modified sinc function is considered:
                              h          ⁡                      (            x            )                          =                              exp            ⁡                          (                              -                                  x                  2                                            )                                ⁢                                                    sin                ⁡                                  (                                      π                    ⁢                                                                                  ⁢                    x                                    )                                                            π                ⁢                                                                  ⁢                x                                      .                                              (        3        )            It is then determined the three parameters A, B, and C that best fit the function A·h(B[x−C]) to the observed phase correlation surface, in a least squares sense. Determining such a fit is complicated, since there is no closed-form solution. It thus requires numerical solution, which can be computationally demanding. Note that this method is also a 1-D sub-pel method, so that it shares the limitation mentioned previously for 1-D sub-pel methods compared to 2-D sub-pel methods.
In “High-Accuracy Subpixel Image Registration Based on Phase-Only Correlation” by Takita et al., it is proposed to fit a 2-D Gaussian function to the phase correlation surface. First, a frequency-domain Gaussian pre-filter (applied to S[m,n]) is used to smooth the phase correlation surface. Second, least squares is used to fit the 7×7 neighborhood of the correlation peak to a Gaussian function. The large window size combined with a least-squares optimization for the complicated Gaussian function can lead to an overly complex algorithm.
Finally, in “Robust Motion Estimation for Video Sequences Based on Phase-Only Correlation” by L. H. Chien et al., it is proposed to fit the following 2-D function to the phase correlation surface near the correlation peak:
                              h          ⁡                      (                          x              ,              y                        )                          =                              α                          N              2                                ⁢                                                                      sin                  ⁡                                      [                                          π                      ⁡                                              (                                                  x                          +                                                      Δ                            ⁢                                                                                                                  ⁢                            x                                                                          )                                                              ]                                                  ⁢                                  sin                  ⁡                                      [                                          π                      ⁡                                              (                                                  y                          +                                                      Δ                            ⁢                                                                                                                  ⁢                            y                                                                          )                                                              ]                                                                                                sin                  ⁡                                      [                                                                  π                        N                                            ⁢                                              (                                                  x                          +                                                      Δ                            ⁢                                                                                                                  ⁢                            x                                                                          )                                                              ]                                                  ⁢                                  sin                  ⁡                                      [                                                                  π                        N                                            ⁢                                              (                                                  y                          +                                                      Δ                            ⁢                                                                                                                  ⁢                            y                                                                          )                                                              ]                                                                        .                                              (        4        )            An unspecified frequency-domain pre-filter is used to smooth the phase correlation surface. An unspecified size for the fitting window is also used, although it appears from a figure that the window size may be 7×7. The complicated nature of the equation h(x,y) leads to a computationally demanding least-squares solution for the Δx and Δy, and for α which must be estimated as part of the solution.