The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
A match rule inputs two objects and returns true if the two objects match and returns false if the two objects do not match. One use case of a match rule is to de-duplicate database objects, such as business contacts, leads, or accounts, which may number into the millions is some databases. If the objects have fields, a match rule may be a Boolean formula on a subset F1, F2, . . . Fn of these fields.
A match rule is intentionally agnostic to the specifics of how a match of two objects is defined on any given field. For any field Fi, a Fi specific matcher returns true when two given objects match on the field and false if the two given objects do not match on the field. This matcher may internally do an exact match, or some form of fuzzy match. Different fields may, and typically do, have different matchers. In the following match rule examples, “+” denotes “OR,” and the implicit “.” denotes “AND.” Match rule 1 specifies F1F2+F3F4, which means that two objects match if and only if the two objects match in either both F1 and F2 or in both F3 and F4. Match rule 2 specifies F1+F2+F3+F4, which means that two objects match if and only if the two objects match in at least one of their four fields. Match rule 3 specifies (F1+F2)(F3+F4), which means that two objects match if and only if they match in F1 or F2, and also match in F3 or F4.
When a set of objects to be matched, such as a database to be de-duplicated, is large, such as in the millions, comparing every pair of objects using a match rule is too slow. To speed up processing in this situation, some database systems resort to an approach called blocking, which involves generating one or more keys for each object in the collection. The keys are generated in such a way that objects that are likely to match tend to have the same value for at least one of the keys. For example, a database system receives a new object, denoted as a probe, which is being considered for insertion into the system's database, and needs to check if the probe is a duplicate of any of the millions of database objects.
The blocking approach generates suitable keys from the probe and finds all objects, denoted as candidates, in the database having at least one key value in common with the probe's keys. The candidates are then, one by one, compared with the probe using a specified match rule. Using suitable keys, this process typically reduces the number of comparisons from millions of objects in the database to only hundreds of candidates which share a key value with the probe.
One of the simplest blocking approaches is to create a key for each field Fi, i=1, 2, . . . n. Let O=(v1, v2, . . . vn) denote an object, where vi is the value of field Fi. The object is placed in n keys, Fi=ci(vi), i=1, 2, . . . n. Here Fi is the key name, vi is its value, and ci is a field specific coarsening function. The non-identity ci is used for fuzzy matching. In the examples below, ci(vi) is assumed to equal vi. Table 1 is a simple example, with n=4:
TABLE 1Object IdF1F2F3F41abgh2cdjP3eflR4adgY
The key map for the data in Table 1 is depicted in Table 2.
Key nameF1F1F1F2F2F2F3F3F3F4F4F4F4Key valueacebdfgjlhpryObject Id(s){1, 4}{2}{3}{1}{2, 4}{3}{1, 4}{2}{3}{1}{2}{3}{4}
The main drawback of this approach is that when a database is large, the size of the candidate list for the probe can be very large. For example, the candidate list for a probe which has a first name value of John and a last name value of Smith will contain all contacts in a database with a first name of John, plus all contacts in the database whose last name is Smith, and probably more contacts as well, which will be a significantly large candidate list for a database which includes millions of contacts.