Decision support analysis on data warehouses influences important business decisions so that accuracy of data in the warehouses is very important. A significant amount of time and money are spent on data cleaning, the task of detecting and correcting errors in data.
The problem of detecting and elimination duplicated is one of the major problems in the broad area of data cleaning and data quality. It is often the case that the same logical real world entity may have multiple representations in the data warehouse. For example, when a customer named Lisa buys purchases products from Circuit City twice, her name might appear as two different records: [Lisa Simpson, Seattle, Wash., USA, 98025] and [Lisa Simson, Seattle, Wash., United States, 98025]. The discrepancy is due to data entry errors and/or preferences of the salesperson who enters the data. Such duplicated information can significantly increase direct mailing costs because several customers like Lisa may receive multiple catalogs. In direct mailing campaigns with tight budget constraints such errors can be the difference between success and failure of the campaign. Moreover, such errors can cause incorrect results in analysis queries (How many Circuit City customers are there in Seattle?) as well as erroneous analysis models to be built.
The process of detecting and eliminating multiple distinct records representing the same real world entity is a fuzzy duplicate elimination problem. The modifier fuzzy is used since the records are not exact matches. This is in contrast to several other terms—merge/purge, dedup, record linkage. This problem is different from the standard duplicate elimination problem for answering “select distinct” queries, in relational database systems which consider two tuples to be duplicates if they match exactly on all attributes. In the following discussion the term fuzzy is dropped for clarity in expression.
Duplicate elimination is a difficult problem because it is caused by several types of errors. These errors include typographical errors, and equivalence errors—different (non-unique and non-standard) representations of the same logical value. For instance, a user's preferences may dictate whether he or she enters “WA, United States” or “Wash., USA” or “WA, United States of America.” Equivalence errors in product tables (“winxp” for “windows XP Professional”) are different from those in bibliographic tables (“VLDB” for “very large databases”), etc. The number of duplicate tuples due to equivalence errors can be high because an equivalence error may result in several duplicate tuples.
The prior art has addressed the class of equivalence errors by building sets of rules. For instance, there are commercially available cleaning software packages that use such rules to detect errors. Previous domain-independent methods for duplicate elimination rely on threshold-based textual similarity functions (e.g., edit distance or cosine metric), which predict that two tuples whose similarity is greater than a pre-specified threshold are duplicates. However, using these functions to detect duplicates due to equivalence errors (say, “US” and “United States”) requires that the threshold be dropped low enough, resulting in a large number of false positives—pairs of tuples incorrectly detected to be duplicates. For instance, tuple pairs with values “USSR” and “United States” in the country attribute are also likely to be declared duplicates if the software detects “US” and “United States” as duplicates using textual similarity.
Several earlier proposals exist for the problem of duplicate elimination. See I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969. B. Kilss and W. Alvey. Record linkage techniques—1985. Statistics of income division. Internal revenue service publication, 1985. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, pages 127-138, San Jose, Calif., May 1995. Alvaro Monge and Charles Elkan. The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and databases (KDD), 1996. A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Ariz., May 1997. W. Cohen. Integration of heterogeneous databases without common domains using queries based in textual similarity. In Proceedings of ACM SIGMOD, pages 201-212, Seattle, Wash., June 1998.
As mentioned above, these methods rely on threshold-based textual similarity functions to detect duplicates, and hence do not detect equivalence errors unless the thresholds are sufficiently lowered; lower thresholds result in an explosion of the number of false positives. The record linkage literature also focuses on automatically determining appropriate thresholds, but still suffers from the false positive explosion. Gravano et al. proposed an algorithm for approximate string joins between relations, which in principle can be adapted to detect duplicate records. Since they use edit distance function to measure closeness between tuples, their technique suffers from the drawbacks of strategies relying only on textual similarity functions.
Most work on data cleaning can be classified into three categories. The first category views the process of cleaning a database as a transformational operation using a set of rules and related domain specific information. The second category consists of domain specific standardization and auditing techniques mostly driven by immediate requirements in the industry. A survey of all available commercial tools is available from Helena Galhardas. Data cleaning commercial tools. http://caravel.inria.fr/˜galharda/cleaning.html. The third category of work focuses on preventing errors from entering the database. Borkar et al. address the problem of minimizing errors in databases due to formatting mistakes and inconsistencies (e.g., the country name appears in the zip code attribute). The important issue of designing and deploying “good” business practices and process flows to prevent problems of deteriorating data quality has been receiving attention as well. Automatic detection of integrity constraints (functional dependencies and key—foreign key relationships) so that they can be enforced in future also has a similar goal. Cleaning operations on existing erroneous databases complement these approaches, which prevent data quality from deteriorating.