The semantic web is an extension of the world wide web that incorporates semantics into the data or web pages that are accessed and downloaded across the internet. Discovering links between data elements of different data sources is a fundamental problem for the emerging semantic web, as well as traditional data integration systems. These links are the building blocks for searching, querying and other higher level services. Link discovery techniques span syntactic methods, many derived from similarity measures developed by Information Retrieval (IR). These techniques include structural ones, e.g., using foreign key relationships between relational database tables to establish links, semantic ones using dictionaries, taxonomies, and ontologies to determine relationships between data elements and instance based ones comparing similarity in values of model entities, e.g. values of columns in a relational database, or instances of concepts in ontologies. Many tools and frameworks have been developed that facilitate link discovery; however, even the best, most automated of these require their users to specify the subsets of the data source to attempt to link as well as the link discovery methods to use.
Therefore, users need a reasonably deep understanding of the data sources and their elements, limiting both which data sources will be linked (only those with which a user is familiar) and the number of data sources that will be linked, since these specifications take time to create. Furthermore, the ability of existing linking systems to scale to a large number of medium to large size data sources is limited due to both the quadratic number of comparisons that need to be performed to exhaustively check for links between pairs of elements of different data sources. The result is that some standard analytic techniques that assume complete access to the whole data are not applicable. To alleviate this burden, more automatic, dynamic and scalable methods for determining which data elements to link across data sources are needed.
Previous attempts to link data sources propose methods based on structural similarity, with many requiring user interaction. Very few existing methods even attempt to use instance values for matching data elements, and such instance-based methods have been described as useful but prohibitively expensive. Examples of methods that use instance value include methods that use values only for validation, methods that look at the distribution and other properties of the values to infer meta-data about the elements, methods that use only a selected sample of instance values, methods that are expensive and do not avoid the quadratic number of comparisons and methods that rely on particular properties of instance values that work only in a limited set of applications. For example, recent work proposes relying on a domain order and sorting of the values to improve efficiency, which works only in a limited number of domains and even in those domains could suffer from other shortcomings such as low accuracy, with both false positives and false negatives.