A portion of the disclosure of this patent document may contain material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document Copyright(copyright) 2001, Microsoft Corp.
The present invention relates to model or schema matching, or more generally to the matching of separate hierarchical data sets. More particularly, the present invention relates to methods and systems for matching models, or schemas, that discover similarity coefficients between schema elements, including analyses based on one or more of schema names, schema data types, schema constraints and schema structure.
Match is a schema manipulation operation that takes two schemas, models or otherwise hierarchically represented data as input and returns a mapping that identifies corresponding elements in the two schemas. Schema matching is a critical step in many applications. For example, in Ebusiness, match helps to map messages between different extensible markup language (XML) formats. In data warehousing, match helps to map data sources into warehouse schemas. In mediators, match helps to identify points of integration between heterogeneous databases. Schema matching thus far has primarily been studied as a piece of other applications. For example, schema integration uses matching to find similar structures in heterogeneous schemas, which are then used as integration points. Data translation uses matching to find simple data transformations. Given the continued evolution and importance of XML and other message mapping, match solutions are similarly likely to become increasingly important in the future.
Schema matching is challenging for many reasons. First and foremost, schemas for identical concepts may have structural and naming differences. In addition, schemas may model similar, but yet slightly different, content. Schemas may be expressed in different data models. Schemas may use similar words that may nonetheless have different meanings, etc.
Given these problems, today, schema matching is done manually by domain experts, sometimes using a graphical tool that can graphically depict a first schema according to its hierarchical structure on one side, and a second schema according to its hierarchical structure on another side. The graphical tool enables a user to select and visually represent a chosen mapping to see how it plays out vis-à-vis the other remaining schema elements. At best, some tools can detect exact matches automatically, although even minor name and structure variations may lead them astray. Despite match being such a pervasive, important and difficult problem, model matching has not yet been studied independently except as it may apply to other more narrow problems, such as those named above, and thus a generic solution for schema matching that can apply to many different data models and application domains remains to be provided. Moreover, such a wide variety of tools would benefit from a matching solution that an independent match component or module that can be incorporated into or downloaded for such tools would be of great utility.
For a more detailed definition, a schema consists of a set of related elements, such as tables, columns, classes, XML elements or attributes, etc. The result of the match operation is a mapping between elements of two schemas. Thus, a mapping consists of a set of mapping elements, each of which indicates that certain elements of schema S1 are related to certain elements of schema S2. For example, as illustrated in FIG. 1, a mapping between purchase order schemas PO and POrder may include a mapping element that relates element Lines.Item.Line of S1 to element Items.Item.ItemNumber of S2, as shown by the dotted line. While a mapping element may have an associated expression that specifies its semantics, mappings are treated herein as nondirectional.
A model or schema is thus a complex structure that describes a design artifact. Examples of models are Structured Query Language (SQL) schemas, XML schemas, Unified Modeling Language (UML) models, interface definitions in a programming language, Web site maps, make scripts, object models, project models or any hierarchically organized data sets. Many uses of models require building mappings between models. For example, a common application is mapping one XML schema to another, to drive the translation of XML messages. Another common application is mapping a SQL schema into an XML schema to facilitate the export of SQL query results in an XML format, or to populate a SQL database with XML data based upon an XML schema. Today, a mapping is usually produced by a human designer, often using a visual modeling tool that can graphically represent the models and mappings. To reduce the effort of the human designer, it would be desirable to provide a tool that at a minimum provides an intelligent initial mapping as a starting point for the designer. Thus, it would be desirable to provide a robust algorithm that automatically creates a mapping between two given models.
Also, there is a related problem of query discovery, which operates on mapping expressions to obtain queries for actual data translation. Both types of discovery are needed. Each is a rich and complex problem that deserves independent study. Query discovery is already recognized as an independent problem, where it is usually assumed that a mapping either is given or is trivial. Herein, the problem of schema matching is analyzed.
It is recognized that the problem of schema matching is inherently subjective. Schemas may not completely capture the semantics of the data they describe, and there may be several plausible mappings between two schemas, making the concept of a single best mapping ill defined. This subjectivity makes it valuable to have user input to guide the match for user validation of the result. This guidance may come via an initial mapping, a dictionary or thesaurus, a library of known mappings, etc. Thus, the goal of schema matching and one not yet adequately achieved by today""s algorithms is: Given two input schemas in any data model, optional auxiliary information and an input mapping, compute a mapping between schema elements of the two input schemas that passes user validation.
The following is a taxonomy of currently known matching techniques. Schema matchers can be characterized by the following orthogonal criteria. With respect to schema-based vs. instance-based criteria, schema-based matchers consider only schema information, not instance data. Schema information includes names, descriptions, relationships, constraints, etc. Instance-based matchers either use metadata and statistics collected from data instances to annotate the schema, or directly find correlated schema elements, e.g., using machine learning.
With respect to element vs. structure granularity, an element-level matcher computes a mapping between individual schema elements, e.g., an attribute matcher. A structure-level matcher compares combinations of elements that appear together in a schema, e.g., classes or tables whose attribute sets only match approximately.
With respect to linguistic-based matching, a linguistic matcher uses names of schema elements and other textual descriptions. Name matching involves: putting the name into a canonical form by stemming and tokenization, comparing equality of names, comparing synonyms and hypernyms using generic and domain specific thesauri and matching substrings. Information retrieval (IR) techniques can be used to compare descriptions that annotate some schema elements.
With respect to constraint-based matching, a constraint-based matcher uses schema constraints, such as data types and value ranges, uniqueness, requiredness, cardinalities, etc. A constraint-based matcher might also use intraschema relationships, such as referential integrity.
With respect to matching cardinality, schema matchers differ in the cardinality of the mappings they compute. Some only produce one to one mappings between schema elements. Others produce n to one mappings, e.g., matchings that map the combination of DailyWages and WorkingDays in the source schema to MonthlyPay in the target.
With respect to auxiliary information, schema matchers differ in their use of auxiliary information sources such as dictionaries, thesauri and input match mismatch information. Reusing past match information can also help, for example, to compute a mapping that is the composition of mappings that were performed earlier.
With respect to individual vs. combinational matching, an individual matcher uses a single algorithm to perform the match. Combinational matchers can be one of two types: hybrid matchers and composite matchers. Hybrid matchers use multiple criteria to perform the matching. Composite matchers run independent match algorithms on the two schemas and combine the results.
In light of the above taxonomy, there are a number of known matching algorithms. The SEMINT system is an instance-based matcher that associates attributes in the two schemas with match signatures. The SEMINT system includes 15 constraint-based and 5 content-based criteria derived from instance values and normalized to the [0,1] interval, so that each attribute is a point in 20-dimensional space. Attributes of one schema are clustered with respect to their Euclidean distance. A neural network is trained on the cluster centers and then is used to obtain the most relevant cluster for each attribute of the second schema. SEMINT is a hybrid element-level matcher, but does not utilize schema structure, as the latter cannot be mapped into a numerical value.
The DELTA system groups all available metadata about an attribute into a text string and then applies IR techniques to perform matching. Like SEMINT, the DELTA system does not make much use of schema structure.
The LSD system uses a multilevel learning scheme to perform one to one matching of XML Document Type Definition (DTD) tags. A number of base learners that use different instance-level matching schemes are trained to assign tags of a mediated schema to data instances of a source schema. A metalearner combines the predictions of the base learners. LSD is thus a multi strategy instance-based matcher.
The SKAT prototype implements schema-based matching following a rule-based approach. Rules are formulated in first order logic to express match and mismatch relationships and methods are defined to derive new matches. The SKAT prototype supports name matching and simple structural matches based on isA hierarchies.
The TranScm prototype uses schema matching to drive data translation. The schema is translated to an internal graph representation. Multiple handcrafted matching rules are applied in order at each node. The matching is done top down with the rules at higher level nodes typically requiring the matching of descendants. This top down approach performs well only when the top level structures of the two schemas are quite similar. The TranScm prototype represents an element level and schema-based matcher.
The DIKE system integrates multiple Entity Relationship (ER) schemas by exploiting the principle that the similarity of schema elements depends on the similarity of elements in their vicinity. The relevance of elements is inversely proportional to their distance from the elements being compared, so nearby elements influence a match more than ones farther away. Linguistic matching is based on manual inputs. DIKE is a hybrid schema-based matcher utilizing both element and structure-level information
ARTEMIS, the schema integration component of the MOMIS mediator system, matches classes based on their name affinity and structure affinity. MOMIS has a description logic engine to exploit constraints. The classes of the input schemas are clustered to obtain global classes for the mediated schema. Linguistic matching is based on manual inputs using an interface with WordNet. ARTEMIS is a hybrid schema-based matcher utilizing both element and structure-level information.
However, each of the above solutions does not provide an adequate solution to the generic problem of matching schemas. While some of the above solutions may be adequate for a given matching task, due to a design for the particular task, the solution is not a general all purpose approach to model matching. Others were not designed for matching per se, but rather were designed for some other purpose such as schema integration, and thus the techniques applied to matching for these solutions make compromises that do not generalize adequately. Still other existing algorithms are too slow on today""s hardware for interactive use, as a result of exhaustive calculations and the like.
There is thus a need for a mechanism or component that provides a complete general purpose schema matching solution. There is further a need for a general solution that considers all of the issues surrounding the above-described taxonomy, and includes a plurality of optimally combined algorithms. There is further a need for a method that automatically generates similarity coefficients for use in mapping two models. There is still further a need for a solution that is as consistent as possible with a given set of similarity relationships between elements of the two models. There are additional needs to be as consistent as possible with key and foreign key definitions in the two models, to relate objects of similar structure, to relate objects that have similar subtree structure and to relate objects that have similar leaf sets. There is also a need for an algorithm that achieves the above needs, but that is also fast enough to be used in real-time, e.g., by an interactive design tool.
In view of the foregoing, the present invention provides systems and methods for automatically and generically matching models, such as may be provided in a matching application or matching component, or provided in a general purpose system for managing models. The methods are generic since the methods apply to hierarchical data sets outside of any particular data model or application. Similarity coefficients are calculated for, and mappings can be discovered between, schema elements based on their names, data types, constraints, and schema structure, using a broad set of techniques. Some of these techniques include the integrated use of linguistic and structural matching, context dependent matching of shared types, and a bias toward subtree structure where much of the schema content resides.
Other features and embodiments of the present invention are described below.