Deduplication is a process of identifying references in data records that refer to the same real-world entity. A common example is the case where two data records refer to a same person, but with different name spellings or with related address records that have minor differences. Deduplication is often used in data cleaning processes. Collective deduplication is a generalization of finding types of real-world entities in a set of records that are related. For example, in a collective deduplication scenario, given a database of paper (e.g., journal article) references, an ideal system would identify all records that refer to the same paper. Such a system might also produce a duplicate-free set of all conferences in which a paper was published. It might be assumed that the same paper is not published in several conferences, and this constraint might be expected to hold in the deduplicated output. In general, the output of collective deduplication may be a set of several partitions of the input records (by entity type) that satisfy constraints in the data. FIG. 1 shows a concrete example of a deduplication scenario. A database may have a table of authors 100 and a table of papers 102, which may be related to the table of authors 100 by a common key or identifier (“id”). A deduplication process might identify papers and/or authors that are determined to be duplicates (a group of duplicates may be referred to herein as a cluster).
Prior approaches to deduplication have focused on string similarity, working under the assumption that two strings that are sufficiently similar to each other are likely to correspond to the same real-world entity. However, string similarity alone may not capture constraints that naturally hold in a given set of data. As a result, many clustering algorithms do not incorporate constraints. Furthermore, most prior approaches to database deduplication are inflexible because: (1) they may only allow clustering of a single entity type (e.g., author, or paper, or conference) in isolation, which makes it is difficult to answer queries that refer to multiple entity types such as, “how many distinct papers were in ICDE 2008?”; (2) they may ignore constraints, which prevents users from encoding valuable domain knowledge; or (3) they may use constraints in an ad-hoc way which prevents users from flexibly combining constraints to suit their application needs.
Described below is a declarative framework for collective deduplication of entity references in the presence of constraints, including a language for specifying deduplication constraints, and algorithms for executing programs in the language.