Decision support analysis on data warehouses influences important business decisions; therefore, accuracy of such analysis is crucial. However, data received at the data warehouse from external sources usually contains errors, e.g., spelling mistakes, inconsistent conventions across data sources. Hence, significant amount of time and money are spent on data cleaning, the task of detecting and correcting errors in data.
Identifying and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality. Many times, the same logical real world entity may have multiple representations in the data warehouse. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers—[Lisa Simpson, Seattle, Wash., USA, 98025] and [Lisa Simson, Seattle, Wash., United States, 98025]—due to data entry errors. Such duplicated information can significantly increase direct mailing costs because several customers like Lisa may be sent multiple catalogs. Moreover, such duplicates can cause incorrect results in analysis queries (say, the number of SuperMart customers in Seattle), and erroneous data mining models to be built. The problem of detecting and eliminating multiple distinct records representing the same real world entity is referred to as the fuzzy duplicate elimination problem, which is sometimes also called merge/purge, dedup, record linkage problems. See M. Hernandez and S. Stolfo. “The merge/purge problem for large databases.” In Proceedings of the ACM SIGMOD, pages 127-138, San Jose, Calif., May 1995. See also A. Monge and C. Elkan. “An efficient domain independent algorithm for detecting approximately duplicate database records.” In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Ariz., May 1997. And also I. P. Felligi and A. B. Sunter. “A theory for record linkage.” Journal of the American Statistical Society, 64:1183-1210, 1969.
This problem is different from the standard duplicate elimination problem, say for answering “select distinct” queries, in relational database systems which considers two tuples to be duplicates if they match exactly on all attributes. However, data cleaning deals with fuzzy duplicate elimination. In the following, the term duplicate elimination is used to mean fuzzy duplicate elimination.
Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and equivalence errors-different (non-unique and non-standard) representations of the same logical value. For instance, a user may enter “WA, United States” or “Wash., USA” for “WA, United States of America.” Equivalence errors in product tables (“winxp pro” for “windows XP Professional”) are different from those encountered in bibliographic tables (“VLDB” for “very large databases”), etc.
A common approach for identifying duplicates is the use of a duplication function—a function indicating whether two given tuples are fuzzy duplicates or not. Moreover, most common duplication functions employ thresholded similarity functions as duplication functions. That is, the duplication function has two components: a similarity function sim and a threshold value c. If the similarity sim(v1, v2) between two tuples v1 and v2 is greater than or equal to c, then v1 is identified as a duplicate of v2.
In the prior art, techniques addressing the problem of identifying fuzzy duplicates assumed that the relationship “is a duplicate of” is transitive. That is, if tuple v1 is a duplicate of v2, and v2 is a duplicate of v3, then v1 is also considered a duplicate of v3. However, this approach lacks robustness if the duplication function is not precise. That is, if the duplication function is prone to mistakes indicating two tuples that are not duplicates as duplicates and vice versa then the assumption of transitivity further increases the amount of error. In practice, most duplication functions are based on the similarity, according to a chosen similarity function, between two tuples being higher than a threshold See M. Hernandez et al and A. Monge et al cited above. This class of duplication functions is not transitive.
Several proposals have been made for the problem of duplicate elimination (e.g., Feligi et al above, B. Kilss and W. Alvey. Record linkage techniques—1985. Statistics of income division. Internal revenue service publication, 1985. Available from http://www.bts.gov/fcsm/methodology/., Hernandez et al, Alvaro Monge and Charles Elkan. “The field matching problem: Algorithms and applications.” In Proceedings of the second international conference on knowledge discovery and databases (KDD), 1996. W. Cohen. “Integration of heterogeneous databases without common domains using queries based in textual similarity.” In Proceedings of ACM SIGMOD, pages 201-212, Seattle, Wash., June 1998.).
As mentioned earlier, these methods rely on threshold-based textual similarity functions to detect duplicates, and hence do not detect equivalence errors unless the thresholds are sufficiently lowered; lower thresholds result in an increase in the number of false positives. The record linkage literature also focuses on automatically determining appropriate thresholds (See Felligi et al and Kilss et al), but still suffers from the false positive explosion while detecting equivalence errors. Gravano et al. proposed an algorithm for approximate string joins, which in principle can be adapted to detect duplicate records. See Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian Saita. “Declarative data cleaning: Language, model, and algorithms.” In Proceedings of the 27th International Conference on Very Large Databases, pages 371-380, Roma, Italy, Sep. 11-14 2001. Since they use the edit distance function to measure closeness between tuples, their technique suffers from the drawbacks of strategies relying only on textual similarity functions. Aslam et al develop (See A. Aslam, K. Pelekhov, D. Rus. Static and Dynamic Information Organization with Star Clusters. CIKM 1998, pages 208-217.) an algorithm for grouping tuples in a relation into “star” groups with one central tuple connected to by several fringe tuples. The input to the algorithm already consists of all pairs of fuzzy duplicates which, is the most expensive part of the duplicate identification.
A significant amount of work exists in other related aspects of data cleaning: the development of transformational cleaning operations. See Vijayshankar Raman and Joe Hellerstein. Potter's wheel: An interactive data cleaning system. VLDB 2001, pages 381-390, Roma, Italy. And Galhardas et al above. The detection and the correction of formatting errors in address data (See Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In Proceedings of ACM Sigmod Conference, Santa Barbara, Calif., May 2001.), and the design of “good” business practices and process flows prevent problems of deteriorating data quality (See Felix Naumann and Claudia Rolker. Do metadata models meet iq requirements? In Proceedings of the international conference on data quality (IQ), MIT, Cambridge, 1999.) Automatic detection of integrity constraints (functional dependencies and key—foreign key relationships) so that they can be enforced in future to improve data quality are complementary to techniques for cleaning existing data. Because of the commercial importance of the data cleaning problem, several domain-specific industrial tools exist. Galhardas provides a nice survey of many commercial tools. See Helena Galhardas.Data cleaning commercial tools.http://caravel.inria.fr/˜galharda/cleaning.html.
Other work has exploited the existence of dimensional hierarchies to identify equivalence errors at various levels of the hierarchy See R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of VLDB, Hong Kong, 2002.