Generally, record matching, or finding records in a database that refer to the same entity or have a semantic relationship, is a challenging problem faced in many increasingly important data management applications. This task is often considered a critical part of data cleaning tools and ETL (Extract, Transform, Load) technologies. On the other hand, there is an increasing need for record matching in semantic data management and the semantic Web. Accurate and efficient matching of data records allows publication and maintenance of high-quality data sources and avoids creation of “islands of data” or “data silos”, a problem well recognized in the semantic Web community.
Existing record matching techniques perform matching based on either string similarity, (ontology-based) semantic relationships, existence of co-occurrence information or limited combinations thereof. However, these techniques sometimes fail to capture many similarities occurring in real world matching and linking scenarios, or often result in false positives (i.e., match records that do not match). As supported by our experience in matching and linking records in many real data sets, a major source of the failure of existing techniques is lack of a flexible matching operator that apart from using string similarities and semantic relationships between full record values, may use semantic information about all different parts of the value stored in the records.