Matching patches between two images, or between regions of the images, is also referred to as computing a nearest neighbor field and is a common technique used for image processing and computer graphics applications. Patches of an image may be compared as each pixel of the image, or may be a larger region of the image that includes a grid of multiple pixels. One technique for determining matching patches between two images is to exhaustively search for the best matching patch in one of the images for every patch in the other image. Although this technique is a simple algorithm, it is computationally expensive and time-consuming.
There are other more efficient algorithms that can be utilized to speed up the matching process, such as by utilizing a spatial constraint that adjacent patches in one image tend to have the same spatial relationship with the matching patches in the other image. However, these algorithms are directed to reconstructing one image from the other and often produce patch matches that are spatially incoherent, with the resulting nearest neighbor fields being based on reconstruction errors. Conventional techniques and algorithms to compute nearest neighbor fields between images do not enforce spatial coherency of the matching patches, and may not match a patch in one image to the respective, same patch in another image. For example, a white color patch in one image may be matched to any number of white color patches in another image without maintaining the spatial coherency of the actual corresponding patches in the two images.
Optical flow is the problem of inferring the apparent motion between images, and conventional algorithms for optical flow are utilized to compute a motion field, such as for optical flow registration, which is useful for image tracking, motion segmentation, and other motion processing applications. A nearest neighbor field typically provides only a very noisy estimate of the true optical flow field for the images. A motion field can be computed between two images, where the direction and magnitude of optical flow at each location is represented by the direction and length of arrows in the motion field. A motion determination between images can be utilized to track object motion, such as in video frames. For example, in a robotics application, cameras may capture two or more separate images of a scene and/or subject from slightly different perspectives and combine the separate images into one image to reduce or eliminate noise in the images. The noise effect will be different in each of the images, and the combined image is a smoothed combination of the separate images that reduces or eliminates the noise effect of each image.
Although optical flow algorithms can enforce the spatial coherency of pixels and/or patches between images, the conventional algorithms assume that the pixel motion of objects (e.g., object displacement) from one image to the next is very small. Further, optical flow registration algorithms that extract feature points do not provide a dense motion field of the images. Additionally, the conventional algorithms often produce incorrect results because they are initialized to start from an initial motion field that is typically set to zero everywhere, and therefore cannot account for a large motion between two images.
To account for large motions between images, techniques that utilize a coarse-to-fine framework are used, such as by running optical flow over an image pyramid and initializing the flow for the next most detailed level to be the up-sampled flow computed from the current coarser level. At each pyramid level, the flow is refined locally using a differential formulation that is valid only for small motions. Another conventional solution to account for large motions is to form a large discrete labeling problem, where labels at pixels represent 2D motion vectors and a dense sampling of the possible motions, including large motions, defines the label set. Another solution can be used to combine the results of a traditional continuous optical flow algorithm with a sparse flow defined by scale invariant feature transform (SIFT) matching. Matching SIFT features, or descriptors, can be extracted at each pixel and/or patch of an image to characterize image objects and encode contextual information of the pixels and/or matching patches.
However, each of these conventional solutions to account for large motions between images have limitations. For example, with the image pyramid solution, down-sampling the input images to create the pyramid removes details that are needed for accurate patch matching, particularly for small or thin objects. In fact, small or thin objects may be completely removed or become so obscured at a coarse level of the pyramid that it not possible for an optical flow algorithm to correctly match those areas at that level. The flow at a next, more detailed level of the pyramid is initialized to the upsampled flow from the coarser level. Because the flow is then updated with only small motion changes for local matching, the correct matches for small or thin areas will not be discovered in this next step if the correct flow is very different from the initial flow. This will be the case even if the details of the small or thin object are visible at this next most detailed level.
The image pyramid solution can provide sub-pixel level accuracy and capture motion boundaries, but is over smooth for small structures. The flow details of small or thin structures are often missed and the flow in those areas are typically computed as the flow of the surrounding background. Using nearest neighbor fields that match patches independently at the desired output resolution can capture fine details of the flow, handle large motions, and be computed quickly. However, using nearest neighbor fields does not provide the fine sub-pixel accuracy, returns poor flows for patches containing motion boundaries (e.g., because there is actually more than one motion present), and does not match repetitive patterns and textureless regions very well due to the ambiguity of matching and the lack of a global optimization formulation with smoothness to resolve these ambiguities.
The large discrete labeling solution avoids the problem of working on down-sampled images to compute large motions by considering a large set of discrete motions at each pixel that can account for the largest motion present. The difficulty, however, is that it can lead to large labeling problems which require a very long computation time to solve. For example, to account for just ten pixels of motion in a direction up, down, left, or right, and if only integer motions are considered, the solution would require 441 labels (i.e., 21×21) for v=(dx,dy), where both dx and dy are −10, −9, . . . , −1, 0, 1, . . . , 9, 10. This also illustrates that the solution restricts the accuracy of the flow solution to a discrete set of possibilities. Sub-pixel accuracy can be obtained by considering more labels, but this leads to even larger labeling problems that quickly become impractical to solve. For example, to determine accuracy to 0.25 pixels in the previous example, both dx and dy would need to be considered for −10, −9.75, −9.50, −9.25, −0.9, −8.75, . . . , 9, 9.25, 9.5, 9.75, 10. This would require 6561 labels (i.e., 81×81), which is approximately fifteen times more labels than when just using the integer flows.
The solution to combine the results of a traditional continuous optical flow algorithm with a sparse flow defined by SIFT matching avoids the large set of potential labels as described in the solution above, while obtaining potentially large motion matches for some pixels in the input images. However, SIFT matching can only provide a sparse flow, and the difficult problem remains of how to interpolate the flow in areas where the SIFT features are not detected. Additionally, SIFT feature matching is not effective for repetitive patterns and deforming objects in the input images.