When finding similar records within a set of records (e.g., entity resolution), one of the challenges is overcoming the N-Squared problem. This is, it is often undesirable or impracticable to compare each record to every other record within a set of records to find which records are similar to each other. For instance, comparing each record to every other record is not practical for anything but relatively small sets of records.
Some conventional techniques may use record blocking as a way of reducing the number of comparisons. Record blocking places two limits on how fuzzy the comparison of data can be. First, there must be some value that matches exactly between the two records, and, second, the data to be matched must appear in a fixed location within the record. Another conventional technique is the nearest neighborhood/sliding window approach that sorts the data in various ways and only compares those records within a certain distance (window) of the current record. The sort requires a predefined key. However, these conventional approaches may require intimate knowledge of the data to pick precisely the parts of the records that are required to match. With the democratization of data, more and more users are expecting to be able to work with data without having such technical knowledge. Also, these conventional approaches may cause problems handling free-form data (e.g., relatively unstructured data). For instance, record blocking places a restriction on where that data needs to appear within the record, and the nearest neighborhood/sliding window approaches uses a predefined key which is difficult to define with free-form data.