Many increasingly important data management and mining tasks require integration and reconciliation or fusion of data that reside in multiple large and heterogeneous data sources. Data integration is generally defined as combining data that reside in different sources and providing users with a unified view of the data. In data fusion, duplicates are merged and conflicting attributes values are identified and possibly repaired in order to provide a single consistent value for each data attribute. Data fusion, therefore, involves duplicate detection, also known as Entity Resolution or record linkage, where the goal is to identify data records that refer to the same entity.
The first step in a data integration or fusion system is identification of “linkage points” between the data sources, i.e., finding correspondences between the attributes in the data sources that can be used to link their records or entities. Traditionally, this is performed by schema matching, where the goal is to identify the schema elements of the input data sources that are semantically related. However, the massive growth in the amount of unstructured and semi-structured data in data warehouses and on the Web has created new challenges for this task. With the increasing size and heterogeneity of data sources, the task can no longer be performed manually using simple user interfaces or with specific heuristics that work well only for a certain type of data or domain. In addition, the noise and error present in data extracted from text documents or large legacy repositories make the task even more challenging.