As is generally known, fuzzy search involves a manner of searching where two strings (contiguous groupings of characters) are considered to be matched if their differences lie within predetermined bounds. Thus, an exact match is not essential as long as, e.g., two strings are similar within a predetermined quantitative parameter. Such a parameter, which can quantify differences between strings, could be represented by edit distance. Edit distance indicates how many edit operations would be required to convert one string to another; atomic edit operations used in such a calculation can include, e.g., adding or deleting a character, or replacing a character. Thus, the edit distance between “cat” and “cart” is 1, as it is between “cat” and “can”, or “cat” and “at”. Generally, for given strings of size n1 and n2, edit distance calculation can be done in O(n1n2) using a conventional dynamic programming algorithm.
Fuzzy search, which can be considered to include fuzzy matching, may be of use in a variety of instances, such as in entity resolution. In entity resolution, an identification is made of an instance of an entity mentioned in a structured record or unstructured document; thus, fuzzy matching can be of help when accommodating unstructured documents that may be of low quality (e.g., may include various typographical errors).
Distributed parallel computing provides another viable platform for fuzzy search. Here, processing tasks are generally dispersed across multiple processors operating on one or more computing devices such that parallel processing may be executed simultaneously. Important implementations of large scale distributed parallel computing systems are MapReduce by Google®, Dryad by Microsoft®, and the open source Hadoop® MapReduce implementation. (Google® is a registered trademark of Google Inc. Microsoft® is a registered trademark of the Microsoft Corporation in the United States, other countries, or both. Hadoop® is a registered trademark of the Apache Software Foundation.) Here, different machines may each process part of a query independently, with results then being aggregated. One application thus arising here involves fuzzy join, where principles similar to those used in fuzzy search are employed.
Generally, conventional methods and arrangement have fallen short in providing an efficient manner of fuzzy match or fuzzy join that avoids the considerable generation of superfluous results.