The problem of integrating data from multiple sources is becoming more acute, with the increasing spread of electronic data storage. According to the foreword of the recent IJCAI-03 Workshop on Information Integration on the Web:                “Effective integration of heterogeneous databases and information sources has been cited as the most pressing challenge in spheres as diverse as corporate data management, homeland security, counter-terrorism and the human genome project. An important impediment to scaling up integration frameworks to large-scale applications has been the fact that the autonomous and decentralized nature of the data sources constrains the mediators to operate with very little information about the structure, scope, profile, quality and inter-relations of the information sources they are trying to integrate.” (See: www.isi.edu/info-agents/workshops/ijcai03/proceedings.htm)        
The problem has a long history and has been considered from two perspectives—instance (or record)-based and schema (or ontology)-based approaches. The term “schema” can be taken to mean a framework for representing information about real world objects (for example, employees) in a computerised information storage system. A schema comprises (in general) a number of attributes applicable to each object (such as payroll number, first name, surname, age, etc.), and possibly information about restrictions on the values of attributes. A data source is a representation of a set of objects by means of their associated attributed values.
The problem of record linkage was identified in the USA public health area, when combining different records that (possibly) referred to the same patient. Newcombe [1] proposed a frequency-based approach which was later formalised by Fellegi and Sunter [2]. These approaches assume that the two data sources have common attributes, and are commonly applied to the so-called “merge/purge” problem in business databases to filter out duplicate entries. The methods focus on calculating a weight for each attribute in the database, according to the likelihood of finding matching values within that attribute's domain (i.e. the set of all values appearing in the column).
The initial formulation treated binary matches (true/false) but was extended to categorical matches (one of a small set of values) and continuous matches (e.g. a number in the interval [0, 1]). By assuming conditional independence between records matching on different attributes it is possible to estimate the conditional probabilities for each attribute matching, given that the records are (or are not) identical, and hence to find thresholds for classifying two records as matching or not according to the weighted sum of matches. The estimation can be on the basis of minimum error probabilities, expectation maximisation, utility (cost of incorrect decision) etc—see [3] for an overview.
These methods implicitly take into account knowledge of the database schema, as they assume each record consists of the same set of attributes.
The record linkage problem was extended to analytic linkage (also referred to as entity matching) by considering the combination of data taken from two or more sources e.g. the integration of heterogeneous databases. Dey et al [4] give a summary of probabilistic approaches, based on the same framework as the record linkage work outlined in the previous paragraph. Again, knowledge of the schema is assumed in that matching pairs of attributes are known.
These methods use several techniques to try to match attributes, such as standardising the form of names and addresses, and applying heuristics (for example first-n-characters match, common substrings, edit distance is below a specified threshold). Bilenko, Mooney et al [5] describe “SoftTF-IDF”, an adaptive matching function, which takes account of the frequencies of similar and identical words within a domain.
The problem can also be approached at the schema level, by looking at labels (i.e. attribute names) and constraints associated with allowed values.
Several tools have been proposed to aid in the automation of this problem, including                Cupid [6]        Glue [7]        OntoBuilder [8]        Prompt [9]        
Rahm and Bernstein [10] survey some of these tools and classify schema-matching into three main groups, with methods arising from the fields of:                information retrieval—using distance-based matching techniques such as the edit distance to overcome the inadequacy of exact, “keyword-based” matching. These assume the use of fairly simple mappings between attribute domains.        machine learning—using algorithms to create a mapping between attributes based on the similarity among their associated values. Bayesian classifiers are the most common approaches (e.g., GLUE [7] and Autoplex [11])        graph theory—by representing schemata in tree or graph form, e.g. the TreeMatch algorithm [6] which estimates the similarity of leaf nodes in an XML DTD by estimating the similarity of their ancestors.        
There are also a number of hybrid approaches to schema-matching which combine methods from the above categories.
Gal et al [12] recognised a need to include uncertainty in the matching process, and outlined a fuzzy framework for schema integration. Gal has also looked at the problem of evaluating the matching between schemata, compared to a notional “ideal” matching that would be produced by a human.
Search Software America, now using the name “Identity Systems”, markets a name and address matching package which:                “automatically overcomes the vast majority of problems arising from spelling, typing and transcription errors; nicknames, synonyms and abbreviations; foreign and anglicized words; prefix and suffix variations; the concatenation and splitting of words; noise words and punctuation; casing and character set variations” (See http://www.identitysystems.com/)        
Although full technical details are not available, this software appears to implement a matching service based on the standard probabilistic record-linkage algorithms outlined above.
Two further papers from the same author (Gal et al [13] and [14]), look at mappings between schemata by combining mappings between an attribute in one schema and a “similar” attribute in a second schema. The mapping is represented as a fuzzy relation—one consequence of this is that the mapping must be symmetric. These papers suggest using a simple weighted average to combine mappings between pairs of attribute into a mapping between schemata. In some cases they consider a wider range of factors in matching attributes, taking account of attribute names as well as attribute values. They are not concerned with mappings between entities—indeed, it does not appear from the experiments (Gal et al [13], section 6) that they have considered mappings between entities, focussing instead on the relation between each approximate mapping (between attribute pairs) and a human-defined “best mapping” (Gal et al [13], section 6.3).
Ying Ding and Schubert Foo [15] is a survey paper, focussing on the ontology mapping problem in the world wide web (regarding an ontology as roughly equivalent to a schema). The methods surveyed rely on manual input (see table 2) and do not address the issue of uncertainty in the mapping between attribute values and in the mapping between objects. Much of the focus is on the problem of ontology maintenance and evolution.
Prior art patent documents include the following:
US2005060332 (Bernstein et al), which describes a method for schema matching (rather than object matching). It uses mappings between attributes but then combines these into an overall measure for a mapping between schemata using an arbitrary formula.
US2004158567 (Dettinger et al), which describes a system for assisting the manual development of mappings between schemata, by examining constraints associated with an attribute from one schema and only proposing candidate attributes (from the second schema) whose values obey those constraints. The mappings between attributes are crisp, and do not take account of uncertainty; and
US2005055369 (Gorelik et al), which relates to a schema matching problem in relational databases and produces a mapping between objects represented in different databases and a “universal” set of objects (UDO). The mappings between attributes are crisp, i.e. do not involve any uncertainty, and a mapping is chosen if the proportion of entities it links is greater than some threshold. Accepted mappings between attributes are combined to give a mapping between objects using join operations on the database, i.e. by using crisp equality with no scope for any partial matching.
A problem remains of how best to create a mapping between two (or more) data sources which represent (approximately) the same sets of objects (or their sets of objects overlap, at least partially) but which use different schemata i.e. the two sources have different sets of attributes.
In general, where prior art approaches are based on record matching, they assume at least some knowledge of the schema, i.e. it is necessary to specify at least some attributes which correspond to those in another database.