The present invention relates to database applications, and in particular, to schema matching.
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
A recurring manual task in data integration, ontology alignment or model management is finding mappings between complex meta data structures. In order to reduce the manual effort, many matching algorithms for semi-automatically computing mappings were introduced.
Unfortunately, current matching systems severely lack performance when matching large schemas. Recently, some systems tried to tackle the performance problem within individual matching approaches. However, none of them developed solutions on the level of matching processes.
Finding mappings between complex meta data structures as required in data integration, ontology evolution or model management is a time-consuming and error-prone task. This task may be called schema matching, but it can also be labeled ontology alignment or model matching. Schema matching is non-trivial for several reasons. Schemas are often very large and complex. They contain cryptic element names, their schema documentation is mostly scarce or missing, and the original authors are unknown. Some estimate that up to 40% of the work in enterprise IT departments is spent on data mapping.
For that reason, many algorithms (so-called matchers) were developed that try to automate the schema matching task partly. They compute correspondences between elements of schemas based on syntactical, linguistic and structural schema- and instance-information and provide the user with the most likely mapping candidates. Many systems combine the results of a number of these matchers to achieve better mapping results. The idea is to combine complementary strengths of different matchers for different sorts of schemas. This balances problems and weaknesses of individual matchers, so that better mapping results can be achieved.
A wealth of schema matching techniques can be found in literature. See, e.g., E. Rahm and P. A. Bernstein, A survey of approaches to automatic schema matching, in The VLDB Journal, 10 (2001); and P. Shvaiko and J. Euzenat, A Survey of Schema-Based Matching Approaches, in Journal on Data Semantics IV (2005). Some techniques primarily rely on available schema information, whereas others rely on instance data and additional sources like thesauri or dictionaries. The way how the input information is processed highly influences individual performance properties of a matching algorithm. Element level techniques only consider schema elements in isolation such as string-based edit distance, n-gram and soundex code. These techniques are simpler than structure-based approaches and can thus be executed faster.
Examples of schema matching techniques include the ASMOV-System (Automated Semantic Mapping of Ontologies with Validation), the RiMOM-System (Risk Minimization based Ontology Mapping), the SimFlooding approach, Cupid (a schema matching tool from Microsoft Research Labs), the B-Match-Approach, Apfel, eTuner, and the MatchPlanner-System.
All currently promoted matching systems, as understood, use a combination of different matching techniques for improving the quality matching results. The topology of the matching system has a major impact on the performance and the quality of a matching task.
In a recent product release, SAP introduced a new business process modeling tool integrating automatic schema matching for the task of mapping large service interfaces. Computed correspondences are used as a recommendation and starting point to a manual mapping of service interfaces. Therefore suggestions need to have a good quality in order to avoid extra work for correcting wrongly identified correspondences. At the same time, the computation of mapping suggestions must be fast so that the user is not interrupted in the modeling process. After having spent too much time on waiting, some users will not apply auto matching recommendation again. Unfortunately, current state of the art matching systems severely lack performance when matching large schemas. For that reason, only a small set of matchers is currently used, which restricts the achievable result quality.
The reasons for these performance problems are theorized as follows. Schema matching is a combinatorial problem with at least quadratic complexity w.r.t. schema sizes. Even naive algorithms can by highly inefficient on large-sized schemas. Matching two schemas of average size N using k match algorithms results in a runtime complexity of O(kN2). Thus schema matching complexity can easily explode if multiple matchers are applied on bigger sized schemas. Even the most effective schema matching tools in the recent OAEI (Ontology Alignment Evaluation Initiative) Ontology Alignment Contest (OAEI2008) suffered from performance issues.
As discussed in more detail below, only a few systems have addressed the performance problem for schema matching. Unfortunately, most of the proposed techniques are built for individual matchers or are hard wired within specific matching processes.