A schema describes a database supported by a database management system (DBMS). For example in a relational database, the schema defines elements such as tables, fields, relationships, views, indexes, procedures, and functions. Schemas are generally stored in a dictionary. In essence, the elements of the schema define objects in the database.
When there are multiple databases, each database can have its own associated schema, which may be different from other schemas for other databases. In this case, elements in multiple different schemas can correspond to the same object. It is often necessary to discover and establish an explicit mapping between such elements for applications such as data migration when one database has to be merged with another database, or in virtual databases where a single interface is used to access the multiple databases, or in data analysis when the multiple databases are stored in a data warehouse with a single schema.
When given two schemas, an objective of an automatic schema matching (ASM) method is to indicate which pairs of elements from the two schemas are likely to match, and a similarity value of such a match. The ASM typically uses a matcher. Numerous different types of matchers are known. The ASM problem is usually very difficult, because database designers rarely provide full and unambiguous information about the elements in the schemas, and if any such information exists, it is not meant for the AMS. Rather, database designers usually select suitable words or abbreviations for the names of elements, so as to facilitate maintenance.
In such cases, a lexical analysis of the names of the elements is an important approach for ASM. A different type of information that might be useful for the ASM is the structure of the schemas. In many cases, schemas are not represented by a list of names of the elements, but rather, the elements are organized in a hierarchy. For example, a parent element “CustomerName” can have three child elements “First-Name”, “MiddleInitial”, and “FamilyName”. Using such structural information is another technique for the ASM. Many more techniques are known. For example, when values of two database fields come from the same statistical distribution, e.g., over names, or numbers, this can serve as evidence that the corresponding elements in the schema match. Dictionaries, thesauri, and other auxiliary data sources have also been used for ASM purposes
Due to the difficulty of the problem, no single matcher is known to perform best on all ASM tasks. This has led to the idea that multiple matchers of the types described above can be used in combination in a composite matcher. The purpose of the composite matcher is to combine the output of the individual matchers and provide a more accurate set of matching elements.
In most cases, the output of an individual matcher k for a pair of elements S1·Ei, and S2·Ej is a similarity value Vk in an interval [0, 1], where Vk=0 means no similarity at all, and vk=1 an exact match. For a library of K different matchers, an objective is to determine a composite similarity value v that is a function of the individual similarity values Vk, k=1 to K.
Several methods for combining similarity values are known. Those methods are largely heuristic approaches to the fundamental problem of combining evidence from multiple sources. One method uses machine learning to estimate weighting coefficients Wk such that the final similarity value v is a weighted average of the individual similarity measures
  v  =            ∑              k        =        1            K        ⁢                  w        k            ⁢                        v          k                .            
Another method extends the above approach with minimum and maximum operators: vmin=minkwkvk, and vmax=maxkwkvk. Those methods for combining similarity values lead to a higher matching accuracy than that of the accuracy of the individual matchers.
When combining evidence from multiple sources as described above, the improper modeling of correlation and other forms of statistical dependence between variables in the problem domain can be a major cause for errors. For example, when two very similar matchers k and l are applied to the ASM problem, their similarity values vk and vl are highly correlated. When vk is large, then vl is also large, and vice versa.
For example, if the above weighted average of the two similarity values is used, then the same evidence is counted twice. In practice, this results in a phenomenon known as over-confidence. One of the matchers is almost redundant, and including the redundant matcher in the composition process can actually decrease the accuracy of the matching. This effect has been observed in other fields where statistical evidence has to be combined, such as medical diagnosis.
It is desired to correct this over-confidence problem when matching elements in schemas.