The present application is directed to a computer operable system and method which incorporates a software program and algorithm for finding an image of a target picture or picture set in a large image collection based on an image of a query picture image which is an imperfect copy of the target picture image.
The query image may be captured by at least one of, but not limited to, a digital camera, personal data assistant, document scanner, text reader, video camera, motion picture camera, computer, cell phone camera or other device capable of generating image representations of the target image.
The target image may be displayed on a monitor or a computer screen and its picture directly taken by one of the above devices, or the target image may first be printed on a printer or a similar output device and a picture taken by one of the above devices for the reproduction of the target image. Alternatively, the query image could be reproduced from a stored electronic version of a query image.
Due to the manner and devices used to capture the query image, often the captured query image will be of a lower resolution, blurry, distorted by rotation and perspective viewing conditions, and of uneven lightness as compared to the target image.
Thus, the present application is directed to finding or matching similar images in large image collections, although it can also make use of additional types of image content such as text and line drawings. Finding natural pictures is potentially a more difficult problem than finding or matching text or line art images of a collection since the content of such pictures is continuous in the luminance/grayscale domain and it is therefore far more challenging to identify robust and reliable keypoints.
A typical method for matching image correspondence is composed of the following steps:                (1) In a first step, keypoints are identified for distinctive locations in the image such as corners, junctions, and/or light or dark blobs. The goal is to reliably find the same keypoints under different viewing conditions, noise, and various image degradations. One method among the many existing methods, is a Scalable Invariant Feature Transform (SIFT) method discussed by D. G. Lowe in “Distinctive Image Features From Scale-invariant Keypoints,” International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004. Another method is the PCA-SIFT method described by Ke and Sukthankar in “PCA-SIFT: A More Distinctive Representation For Local Image Descriptors,” in Conference on Computer Vision and Pattern Recognition, pp. 111-119, 2000. Both methods require a considerable amount of computations that limits the performance for large image collections.        (2) In the second step, a feature vector called a “descriptor” is calculated from the local neighborhood of every keypoint. The descriptor has to be highly distinctive in order to identify its corresponding keypoint with high probability relative to all the other keypoints in the image. In addition, the descriptor must be robust to noise, keypoint identification errors (e.g., missing or extra keypoints), camera-to-target image geometry and the common image degradations. In order to make the descriptor scale and rotation invariant, a scale-normalized image neighborhood is selected and its primary orientation is determined and applied to rotate the image around the keypoint to bring it in alignment. Multiple descriptors may be generated from a single keypoint in cases when there are multiple possible primary orientations of similar likelihood.        (3) Finally, at query time the descriptor vectors of the query image are compared with the descriptor vectors of all the various images in the collection to determine a possible match. The matching is usually based on a distance measure between two feature vectors such as the L1 or L2(Euclidean) distance. In many cases it is not possible to obtain the exact same keypoint order for the two images (for example, when the two images are arbitrarily rotated with respect to each other). Thus all the possible descriptor pair combinations need to be compared unless additional sorting and indexing of descriptors is applied. Depending on the desired sensitivity, a typical image may give rise to thousands of keypoints (and descriptors). It is therefore desirable to minimize the descriptor dimensionality (the number of features) since the descriptor dimensionality directly impacts the performance (the time it takes to compute distances).        
As mentioned, a wide choice of keypoint identification techniques already exists in the literature. An even wider variety of descriptors have been proposed, based on various approaches, including: Gaussian derivatives, moments, complex features, steerable filters, and phase features, among others. One particular class of feature descriptors introduced by D. Lowe in the International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004 article has been demonstrated to outperform most others in terms of accuracy and speed. This class of descriptors (i.e., SIFT descriptors) is based on the distribution of local small-scale features within the keypoint neighborhood. The SIFT descriptor computes a histogram of the local spatial intensity gradients at 8 different orientations in a 4×4 grid around the keypoint and stores the result in a 128-dimensional vector.
Among all the available methods, the SIFT descriptor seems to be the most widely used. It offers a distinct descriptor that is relatively fast to compute for matching a modest number of images. However, the high dimensionality of the SIFT descriptor makes it impractical for use in real time applications involving large image collections.
Other shortcomings of the SIFT method and its variants include:                (1) Floating point descriptors: Each SIFT descriptor is a 128-element floating-point feature vector that captures a substantial amount of local intensity gradients and orientations in a region around the current keypoint. Depending on the desired detection sensitivity, a typical image may give rise to thousands of keypoints, some of which generating multiple descriptors (for example, when there are multiple primary orientations). This leads to a large amount of information that must be stored in memory for image matching, which can quickly overwhelm the system even for modest image collection sizes. Ideally, it is preferable to have a discrete measure (easily quantizable and of small finite-range) instead of a full floating point range.        (2) Time-consuming orientation histogram: The SIFT method does not use rotation-invariant measures. Instead, the SIFT method relies on the assignment of a consistent primary orientation to each image keypoint. The SIFT method achieves invariance to image rotation by taking the local descriptor intensity gradients relative to the particular keypoint orientation. However, the SIFT orientation assignment process is complex and time consuming. The scale of the keypoint is used to select the Gaussian-smoothed image at the closest scale, so that all computations are done in a scale-invariant manner. For each image sample at this scale, the 2D gradient magnitude and orientation are computed from pixel differences. An orientation histogram is created from the gradient orientations of sample points in a circular region around the keypoint. The orientation histogram has 36 bins covering the 360 degree range. Each histogram sample point is further weighted by a Gaussian-smoothing circular window of a standard deviation 1.5 times the keypoint scale. Peaks in the orientation histogram correspond to dominant directions of the local gradients. The highest histogram peak is detected and its orientation is used for determining the keypoint orientation provided no other local histogram peak is within 80% of the highest peak. A parabola is fit to the three histogram values closest to the peak in order to interpolate the peak position for better accuracy, and the resulting output is assigned to be the final keypoint orientation.        (3) Multiple keypoint orientations: Some keypoints occasionally have multiple peaks in the orientation histogram. Any additional local peak that is within 80% the magnitude of the highest peak is also used to create another possible orientation for the same keypoint. Therefore for keypoints that give rise to multiple histogram peaks of similar magnitude, there will be multiple keypoint orientations, created at the same location and scale but with different orientations. According to the literature, only about 15% of the keypoints are assigned multiple orientations, but this contributes significantly to the matching stability. However, the existence of multiple keypoint orientations increases the amount of descriptor information that has to be stored per keypoint. In addition, it slows down the performance by requiring multiple matching per keypoint.        (4) High dimensionality of SIFT descriptor: The SIFT descriptor is created by sampling the gradient magnitude and orientation around the keypoint, using the scale to select the level of Gaussian blur for the image and rotating the descriptor coordinates relative to the keypoint orientation. A Gaussian weight of a standard deviation 1.5 times the width of the descriptor window is applied to stabilize the descriptor against small changes in window position. The samples are accumulated by summing the content over 4×4 sub-regions, using 8 directions for each orientation histogram. A tri-linear interpolation is used to distribute each sample into adjacent histogram bins. The resulting SIFT descriptor is formed by concatenating the normalized values of all orientation histograms in a 4×4 grid (of 4×4 sub-regions each) around the keypoint into a single 4×4×8=128 floating point element vector. The high dimensionality of the SIFT descriptor has a direct impact on the matching performance due to the need to calculate distances to candidate descriptors in high dimensional space. Thus the matching performance quickly deteriorates as the number of images in the collection increases.        (5) Non compact descriptor storage: The SIFT descriptor is made more distinctive by recording the values of many local gradient magnitudes and orientations around the keypoint. No attempt is made to minimize the information content of the descriptor. The descriptors are typically stored in memory for future image matching. With thousands of keypoints in a typical image, each giving rise to one or more 128-element feature descriptors, the amount of overall information that needs to be stored in memory quickly becomes impractical for even modest image collection sizes.        (6) Poor matching performance for large image collections: The combination of: the high dimensionality of the SIFT descriptor in item (4) above in conjunction with the large amount of descriptor information per item (5) above limit the applicability of the existing method for large image collection sizes due to slow matching performance and increasingly larger amount of memory required.Incorporation by Reference        
The disclosures of U.S. patent application Ser. No. 12/147,624, filed Jun. 27, 2008 for “Method And System For Finding A Document Image In A Document Collection Using Localized Two-Dimensional Visual Fingerprints”, by Doron Kletter et al.; and U.S. patent application Ser. No. 12/147,867, filed Jun. 27, 2008 for “System And Method For Finding Stable Keypoints In A Picture Image Using Localized Scale Space Properties”, by Doron Kletter, are each hereby incorporated herein in their entireties.