Decision support analysis on data warehouses influences important business decisions; therefore, accuracy of such analysis is important. However, data received at the data warehouse from external sources usually contains errors (e.g., spelling mistakes, inconsistent conventions across data sources). These errors often result in duplicate entry of tuples. Hence, significant time and money are spent on data cleaning, the task of detecting and correcting errors in data.
The problem of detection and elimination of duplicated tuples in a database is one of the major problems in the broad area of data cleaning and data quality. It is often the case that the same logical, real-world entity may have multiple representations in the data warehouse.
For example, when a customer named Isabel purchases products from SuperMegaMarket twice, her name might appear as two different records: [Isabel Christie, Seattle, Wash., USA, 98025] and [Christy Isabel, Seattle, Wash., United States, 98025]. The discrepancy may be due to data entry errors and/or preferences of the salesperson who enters the data.
Such duplicated information can significantly increase direct mailing costs because several customers, like Isabel, may receive multiple catalogs. In direct mailing campaigns with tight budget constraints such errors can be the difference between success and failure of the campaign. Moreover, such errors can cause incorrect query results (e.g., How many SuperMegaMarket customers are there in Seattle?) as well as erroneous analysis model creation.
Ridding a database of seemingly distinct, but yet duplicate, entries is the fuzzy duplicate elimination problem. Herein, “fuzzy duplicates” are seemingly distinct tuples (i.e., records) that are not exact matches but yet represent the same real world entity or phenomenon.
This problem is different from the standard exact duplicate elimination problem where two tuples are considered duplicates only when they exactly match all attributes. Unless the context clearly indicates otherwise, assume hereinafter that references to duplicate detection and elimination is focused on the fuzzy duplicate elimination problem.
Previous solutions to fuzzy duplicate elimination can be classified into supervised and unsupervised approaches. Supervised approaches learn rules characterizing pairs of duplicates from training data consisting of known duplicates. Further, these approaches assume that training data exhibit the variety and distribution of errors observed in practice. It is difficult, if not impossible, to obtain such comprehensive training data, an issue that was addressed, to a limited extent, by active learning approaches which have the drawback of requiring interactive manual guidance. In many real data integration scenarios, it is not possible to obtain good training data or interactive user guidance.
The problems of unsupervised duplicate elimination are similar to those of clustering, in that both attempt to partition a dataset into disjoint groups. But, there are some distinct differences between standard clustering formulations and the duplicate elimination problem. These differences will be discussed later.
Current unsupervised approaches tend to ignore these differences and, instead, rely on standard textual similarity functions (e.g., well-known single-linkage clustering algorithms such as edit distance and cosine metric) between multi-attribute tuples and threshold-based constraints for detecting duplicate pairs. However, such threshold-based approaches result in large numbers of false positives (tuples which are not true duplicates but predicted to be so) or large number of false negatives (tuples which truly are duplicates but not recognized as such).