Record matching or linking is the task of identifying records that correspond to the same entity from the same or different data stores. Record matching is useful with respect to improving data quality and standardization. Accordingly, record matching can be employed in data scrubbing or data cleaning, for example in data warehousing applications or the like.
Data cleaning is an essential step in populating and maintaining data warehouses and central data repositories. A significant data cleaning operation is that of “joining” similar data. For example, consider a sales data warehouse. Owing to various errors in data such as typing mistakes, differences in conventions or formats, product names and customer names in sales records may not match exactly with a master catalog and reference customer records, respectively. In these situations, it would be desirable to match similar records across relations. This problem of matching similar records has been studied in the context of record linkage and of identifying approximate duplicate entities in databases.
Given two relations R and S, the goal of the record matching or linking problem is to identify pairs of records in R×S that represent the same real world entity. Most conventional approaches proposed for solving this problem usually compare pairs of tuples according to one or more similarity functions and then declare pairs with high similarities to be matches. In one conventional approach, the similarity function could determine how many deletions, substitutions or insertions are needed to transform a string from one to another. For example, “California” may be sufficiently similar (within a threshold) to mistyped “California” to be deemed a match, as all that is needed is to insert the letter “i.” The main conventional focus is thus on identifying similarity functions and efficient implementations thereof.
It is also to be noted that it is often not clear that a single similarity function will be best in all scenarios. Hence, recent work has focused on identification and utilization of a combination of similarity functions. For instance, if function A produces a value greater than a threshold and function B yields a result greater than another threshold, then the entities can be treated as matching.