When information from multiple sources of data is integrated, the data invariably leads to erroneous duplication of the data when these sources store overlapping information. For example, two organizations can store information about publications, authors and conferences. Owing to data entry errors, varying conventions and a variety of other reasons, the same data may be represented in multiple ways, for example, an author's name may appear as “Edsger Dijkstra” or “E. W. Dijkstra”. A similar phenomenon occurs in enterprise data warehouses that integrate data from different departments such as sales and billing that sometimes store overlapping information about customers.
Such duplicated information can cause significant problems for the users of the data. For instance, errors in data lead to losses in revenue by failing to charge customers for services provided motivating the need for revenue recovery applications that reconcile the billing and services databases in an enterprise by integrating the databases to ensure that every service is billed. Duplicated data can also lead to increased direct mailing costs because several customers may be sent multiple catalogs, or produce incorrect results for analytic queries leading to erroneous data mining models. Hence, a significant amount of time and money is spent on the task of detecting and eliminating duplicates.
This problem of detecting and eliminating multiple distinct records representing the same real world entity is traditionally called the deduplication problem. The problem is challenging since the same record may be represented in different ways, thus rendering simple duplicate elimination by using “select distinct” queries inadequate.
Conventional methods exploit the textual similarity between the records where textual similarity is measured using a similarity function that for every pair of records, returns a number between zero and one, a higher value indicating a better match, with one corresponding to equality. The task of deduplication is to translate this pairwise information into a partition of the input relation.
While duplication in the real world is an equivalence relationship, the relationship induced by the similarity function is not necessarily an equivalence relation; for example, it may not be transitive. Conventional work has therefore proceeded by modeling the individual records as nodes, the pairwise matches as edges in a graph, and using a graph partitioning algorithm to find sets of records to be collapsed. However, these methods are in efficient and time consuming.