The ability to efficiently manage large sets of data is difficult and usually very time consuming and inefficient. One reason is that the concept of a “large” set of data continues to grow with time. In 1980, a megabyte worth of data was “large.” In 1990, it was a gigabyte and in 2000, a terabyte. As data grows, people develop technologies for specific applications of these large data sets. One example of a specific application which utilizes large amounts of data includes image comparison.
Image comparison has a number of applications, including image-based searching. Text-based image searching has been prominent for years. Using text-based searching, for a user to search for images, the images are generally stored in databases with corresponding text phrases such as titles, keywords or captions. The user's search is then based on an entered keyword, and the search returns images if the entered keyword matches one of the text phrases. However, with larger sets of image data, it becomes impractical to store all of the images with text indexes to correspond with each image. It is also highly burdensome for someone to manually attribute specific titles, keywords and captions to each one. Furthermore, text-based searches have their inherent drawbacks as well.
In recent years, image-based searching has become a possible alternative. However, before image-based searching is possible, the data to be searched must exist in a searchable format. One approach to putting the data in such a format included compression by partitioning the underlying space into fixed-size bins where the resulting structure is called a histogram. However, histograms have a number of drawbacks. A finely quantized histogram where the image is mainly a few colors is highly inefficient. But, for a complex image with many colors, a coarsely quantized histogram would be inadequate. Thus, because histograms are fixed-size structures, they cannot achieve the proper balance between expressiveness and efficiency.
Another technique was developed to improve image comparison and image searching capabilities while avoiding some of the drawbacks of histograms. Earth Mover's Distance (EMD) is a distance between two distributions, which reflects the minimal amount of work that must be performed to transform one distribution into the other by moving “distribution mass” around. EMD generally requires a lot of computation power because a large set of data is being compared and many computations must occur. There are many ways of determining EMD, but generally they are accomplished in linear time. Clearly, as datasets grow, where images become more complex with higher definition, a method that functions in linear time is far too inefficient. A number of attempts have been made to improve the efficiency of image comparison.
U.S. Pat. No. 5,999,653 to Rucklidge, discloses fast, low-overhead implementations of a powerful, reliable image matching engine based on the Hausdorff distance. In one such implementation, a method is provided in which a processor receives two inputs. The first input is a pattern to be recognized in an image; the second, a digital image in which the pattern is to be recognized. The digital image is preprocessed with the processor using various morphological dilation operations so as to produce a set of preprocessed digital images. Thereafter, the processor performs a hierarchical search for the pattern in the digital image. The hierarchical search is performed over a search space, and includes a series of decisions, each decision indicating whether a portion of the search space can be eliminated from the search. Each decision is made by performing a plurality of comparisons between the pattern and the preprocessed digital images of the set and analyzing the results of these comparisons. Once the search is complete, the processor outputs a search outcome indicating whether (and where) the pattern has been found in the image.
U.S. Pat. No. 6,748,115 to Gross discloses an improved distance measure for determining a match between models which uses a modification of the Hausdorff measure by limiting such measure to a single quadrant, resulting in a lower mismatch rate.
Other approaches to improving the efficiency of the distance calculations have approximated the composite distances or have avoided them which causes problems with accuracy. Many Hausdorff distance calculators use a method of sifting through all the data multiple times to get the components of the distance function which is very inefficient.
Calculating Hausdorff distances and EMD is generally very computationally expensive and involves complex and costly hardware. Although attempts have been made at improving calculating a Hausdorff distance and EMD, they are still very inefficient.