1. Field of the Invention
The embodiments of the invention generally relate to the data integration problem associated with mappings, and more particularly to the problem of efficient answering of queries through a target schema, given a set of mappings between the data source schema(s) and the target schema.
2. Description of the Related Art
The data inter-operability problem arises from the fact that data, even within a single domain of application, is available at many different sites, in many different schemas, and even in different data models (e.g., relational and Extensible Markup Language (XML)). The integration and transformation of such data has become increasingly important for many modern applications that need to support their users with informed decision making. As a rough classification, there are two basic forms of data inter-operability: data exchange and data integration. Data exchange (also known as data translation) is the problem of moving and restructuring data from one (or more) source schema(s) into a target schema. A relational schema is a graphical depiction of a database structure expressed in text database language defining the tables, the fields in each table, and the relationship between the fields and tables. An XML schema is a description of the structure of an XML document, whereby an XML schema defines the XML elements that can appear in an XML document, the sub-elements and/or the attributes of each XML element, and the relationship between the XML elements. An XML schema is expressed in XML Schema Definition (XSD), a standardized language for defining the structure, content, and semantics of XML documents. Data exchange appears in many tasks that require data to be transferred between independent applications that do not necessarily agree on a common data format. In contrast, data integration is the problem of uniformly querying many different sources through one common interface (target schema). There is no need to materialize a target instance in this case. Instead, the emphasis is on answering queries over the common schema. In both cases of data exchange and data integration, relationships or mappings must first be established between the source schemas and the target schema.
Mappings are often specified as high-level, declarative, assertions that state how groups of related elements in a source schema correspond to groups of related elements in the target schema. Mappings can be given by a human user or they can be derived semi-automatically based on the outcome of schema matching algorithms. Mappings have been used for query rewriting in relational data integration systems, in the form of GAV (global-as-view), LAV (local-as-view) or, more generally, GLAV (global-and-local-as-view) assertions. They have also been used to formally specify relational data exchange systems. A more general form of GLAV that accounts for XML-like structures has been used to give semantics for mappings between XML schemas and to generate the data transformation scripts (in sequential query language (SQL), XML Query Language (XQuery), or Extensible Style Language Transformation (XSLT)) that implement the desired data exchange.
There has been considerable work on XML and semistructured query rewriting, and focus on query optimization by rewriting semistructured queries in terms of materialized views. Some of the conventional solutions address the problem of publishing SQL data in XML by rewriting XML queries into SQL queries. In fact, the conventional techniques also provide solutions on XML-to-SQL translation. In most of the above cases, the source (materialized views, relational store, etc.) to target (XML logical schema, XML view, etc.) mapping is lossless; i.e., it consists of statements (whether explicit or implicit) each asserting that some portion of the XML data is equal to some portion of the relational (store) data. Hence, query rewriting is equivalence preserving. In contrast, most real-life mappings in data integration are lossy and generally offer an incomplete and partial view of the data sources. Moreover, the conventional techniques for XML query rewriting generally fail to work in the presence of such lossy mappings. Additionally, because they assume that the mappings are lossless, the conventional techniques can be used to retrieve generally only a limited subset of the possible answers.
Generally, when a user wants to query multiple heterogeneous data sources, he/she typically formulates a query in terms of a user or target interface (or schema). However, the data resides under different formats or schemas that are the source schemas. Typically, the relationships between the source schemas and the target schema are given in the form of mappings between the sources and the target. In order to answer queries over the target, the query processing system generally has to translate the query from the target schema into queries that use the source schemas. The latter queries can then be evaluated on the data sources to retrieve the answers.
A known solution to this problem is the federated relational database approach in which the mapping between the target schema and the source schemas is specified by letting the target be a view over the source data bases. Then, the query processing subsystem applies query composition techniques to compose the user query with the view to obtain queries formulated in terms of the sources. The main drawback to this approach is that the known techniques are typically confined to the relational model, wherein the sources and the target must be relational, and the queries must be SQL. Other existing solutions generally cannot provide XML views to data sources. Moreover, an important problem in accessing heterogeneous data sources with overlapping information is data merging.
Existing solutions generally do not provide any support for automatic data merging. Under existing approaches, users have to explicitly create views that “know” how to merge the data by joining the data sources. Such an approach does not work well in dynamic environments where new data sources may often appear, since in such a case, the views have to be rewritten by a human user in order to account for the new data sources. These views are often complex and the effort required in their design is considerable. Therefore, due to the drawbacks of the conventional approaches there remains a need for a novel XML query technique used for data integration.