Decision support analysis on data warehouses influences important business decisions; therefore, accuracy of such analysis is important. However, data received at the data warehouse from external sources usually contains errors, e.g., spelling mistakes, inconsistent conventions across data sources, missing fields. Consequently, a significant amount of time and money are spent on data cleaning, the task of detecting and correcting errors in data. A prudent alternative to the expensive periodic data cleaning of an entire data warehouse is to avoid the introduction of errors during the process of adding new data into the warehouse. This approach requires input tuples to be validated and corrected before they are added to the database.
A known technique validates incoming tuples against reference relations consisting of known-to-be-clean tuples in the database. The reference relations may be internal to the data warehouse (e.g., customer or product relations) or obtained from external sources (e.g., valid address relations from postal departments). An enterprise maintaining a relation consisting of all its products may ascertain whether or not a sales record from a distributor describes a valid product by matching the product attributes (e.g., Part Number and Description) of the sales record with the Product relation; here, the Product relation is the reference relation. If the product attributes in the sales record match exactly with a tuple in the Product relation, then the described product is likely to be valid. However, due to errors in sales records, often the input product tuple does not match exactly with any in the Product relation. Then, errors in the input product tuple need to be corrected before it is stored. The information in the input tuple is still very useful for identifying the correct reference product tuple, provided the matching is resilient to errors in the input tuple. Error-resilient matching of input tuples against the reference table is referred to as a fuzzy match operation.
Suppose an enterprise wishes to ascertain whether or not the sales record describes an existing customer by fuzzily matching the customer attributes of the sales record against the Customer relation. The reference relation, Customer, contains tuples describing all current customers. If the fuzzy match returns a target customer tuple that is either exactly equal or “reasonably close” to the input customer tuple, then the input tuple would have been validated or corrected. A notion of closeness between tuples is usually measured by a similarity function. If the similarity between an input customer tuple and its closest reference tuple is higher than some threshold, then the correct reference tuple is loaded. Otherwise, the input is routed for further cleaning before considering it as referring to a new customer. A fuzzy match operation that is resilient to input errors can effectively prevent the proliferation of fuzzy duplicates in a relation, i.e., multiple tuples describing the same real world entity. See Hernandez et al “The merge/purge problem for large databases” in Proceedings of the ACM SIGMOD, San Jose, Calif. May 1995.
Several methods for approximate string matching over dictionaries or collections of text documents have been proposed (e.g., Gravano et al “Approximate string joins in a database (almost) for Free”. In Proceedings of VLDB, Roma, Italy, Sep. 11-14, 2001 and Navarro et al “Indexing methods for approximate string matching.” In IEEE Data Engineering Bulletin, 24(4):19-27,2001.). All of the above methods use edit distance as the similarity function, not considering the crucial aspect of differences in importance of tokens while measuring similarity.
Approximate string matching methods [e.g., R. Baeza-Yates and G. Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997. and G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), LNCS 1848, 2000.] preprocess the set of dictionary/text strings to build q-gram tables containing tuples for every string s of length q that occurs as a substring of some reference text string; the record also consists of the list of identifiers (or locations) of strings of which s is a substring. The error tolerant index relation ETI we build from the reference relation is similar in that we also store q-grams along with the list of record identifiers in which they appear, but the ETI (i) is smaller than a full q-gram table because we only select (probabilistically) a subset of all q-grams per tuple, and (ii) encodes column-boundaries specific to relational domains.
The information retrieval community has successfully exploited inverse document frequency (IDF) weights for differentiating the importance of tokens or words. However, the IR application assumes that all input tokens in the query are correct, and does not deal with errors therein. Only recently, some search engines (e.g., Google's “Did you mean?” feature) are beginning to consider even simple spelling errors. In the fuzzy match operation, we deal with tuples containing very few tokens (many times, around 10 or less) and hence cannot afford to ignore erroneous input tokens, as they could be crucial for differentiating amongst many thousands of reference tuples. For example, the erroneous token ‘beoing’ in the input tuple [beoing corporation, Seattle, Wash., NULL] is perhaps the most useful token for identifying the target from among all corporation records of companys in the Seattle area. Clustering and reference matching algorithms [e.g., W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD, Seattle, Wash., June 1998. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288-321, July 2000.E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1997.] using the cosine similarity metric with IDF weighting also share the limitation of ignoring erroneous input tokens. Further, efficiency is improved by choosing probabilistically a subset of tokens from each document under the correct input token assumption.
As discussed earlier, almost all solutions for the nearest neighbor problem are targeted at data in Euclidean/normed spaces and hence are inapplicable to the present invention. See V. Gaede and O. Gunther. “Multidimensional access methods.” ACM Computing Surveys, 30(2):170-231, 1998. There has been some recent work on general metric spaces [e.g., P. Ciaccia, M. Patella, P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. VLDB 1997. G. Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal, 11(l):28-46, 2002. Their complexity and performance are not suitable for the high-throughput systems of interest here. Moreover, many of these solutions cannot be deployed easily over current data warehouses because they require specialized index structures (e.g., M-trees, tries) to be persisted.
Some recent techniques address a related problem of eliminating “fuzzy duplicates” in a relation by using a similarity function by identifying highly similar tuples as duplicates. Some are based on the use of edit distance [e.g., M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, San Jose, Calif., May 1995.] and some on cosine similarity with IDF weights [e.g., W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288-321, July 2000. ]. Such techniques are designed for use in an offline setting and do not satisfy the efficiency requirements of an online fuzzy match operation where input tuples have to be quickly matched with target reference tuples before being loaded into the data warehouse. A complementary need is to first clean a relation by eliminating fuzzy duplicates and then piping further additions through the fuzzy match operation to prevent introduction of new fuzzy duplicates.