Accurately determining similarity between records may be computationally expensive when trying to find exact and inexact matches to a query in a large table. Traditionally B-Trees, hashes, and inverted indexes of values or fields may be used (so-called “blocking strategies”) to find candidate matches. These matches may be refined in later processing stages using more precise methods.
Hashes or keys may be any value computed from a given record. For example, the first letter of the first name, followed by the first letter of the last name, and the last four digits of the social security number may be a hash or key value. Another example may be to use SOUNDEX codes (phonetic algorithms for indexing names by a representation of their sound) of data fields or cryptographic hashes (to preserve confidentiality in the processing of matches). A system based on such hashes or keys may be limited in that only similarity patterns covered by the keys that were explicitly chosen by the system designer or user are considered. Real-world data variations that do not follow these patterns are at risk of being missed. This may lead to false negatives (missed matches) and may be unacceptable for many queries. Further, it may be difficult to find a good key of the data, for example when dealing with complex product descriptions.
B-Trees and other tree-based data structures allow range queries and may find the longest prefix match between the query and the records. But finding only a prefix match may exhibit many of the same issues as match keys, for example, false negatives may be a risk and may be unacceptable for many queries.
Inverted indexes are full text indices that may use a list of all tokens which occur in the data along with the locations of token occurrence(s). This method may often be used for finding text documents or emails but may also be applied to structured data. This approach may lead to missed matches when the data and query are not tokenized identically, are split across fields, or merged from fields. Examples include dissimilar data fielding between query and record, and foreign names which can be hard to tokenize and field in a table that was designed for western names (e.g., AbuAli al-Husayn ibnSina, xiaolongqui).