Data cleaning is a critical element for developing effective business intelligence applications. The inability to ensure data quality can negatively affect downstream data analysis, and ultimately, key business decisions. An important data cleaning operation is that of record matching that identifies records which match the same real world entity. For example, owing to various errors in data and to differences in conventions of representing data, product names in sales records may not match exactly with records in master product catalog tables. In these situations, it is desirable to match similar records across relations.
Record matching is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario. Specifically, the number of options both in terms of string matching operations as well as the choice of external sources can be daunting.
This problem of matching similar records has been studied in the context of record linkage and of identifying approximate duplicate entities in databases. Given two relations R and S, the record matching problem is to identify pairs of records in R×S that “represent” the same real world entity. Typically, both relations R and S can be quite unclean with many incorrect, abbreviated, and missing values, and therefore, the task of designing a program for accurately matching records is challenging. Such challenges arise in record matching scenarios across a variety of other domains, including, but not limited to, matching customers across two sales databases, matching patient records across two databases in a large hospital, matching product records across two catalogs, etc. In all these scenarios, a primary requirement in addition to accuracy is that the record matching programs also be efficiently executable over very large relations.