Information integration applications take data that is stored in two or more data sources and build from them one large database, possibly a virtual database, containing information from all of the sources, so that the data can be queried as a unit. Thus, for example, enterprise accounting data may be stored within a relational database, and enterprise inventory may be stored within XML documents. Information integration enables an enterprise to access its various data sources from within a single data store application.
Information integration is discussed in Chap. 20 of Garcia-Molina, H., Ullman, J. D. and Widom, J., “Database Systems: The Complete Book”, Prentice-Hall, New Jersey, 2002. As pointed out in Sec. 20.1 of this reference, there are three basic modes of information integration: (i) federated databases, (ii) data warehousing, and (iii) mediation.
In a federated database architecture, data sources are independent, but one source can call on others to supply information.
In a data warehousing architecture, data from several sources is extracted and combined into a global schema. The data is then stored at the warehouse, which appears to the user like an ordinary database. Once data is in the warehouse, queries are issued by a user exactly as they would be issued to any database. However, user updates to the warehouse are generally forbidden, since they are not reflected in the underlying sources, and thus can make the warehouse inconsistent with the sources.
A data warehouse is updated periodically, by reconstructing it from current data in the data sources. Typically, a data warehouse is updated once a night, when the system can be shut down, so that queries are not issued while the warehouse is being constructed. Alternatively, the data warehouse may be incrementally updated based on changes that have been made to the data sources since the last time the warehouse was modified.
Conventional data warehouses are generally expensive and inflexible. In addition, such data warehouses generally do not provide real-time operation.
In a mediation architecture, a software component, referred to as a mediator, supports a virtual database, which a user may query as if it were physically constructed. The mediator stores no data of its own. Rather, it translates a query into one or more queries to its sources, synthesizes the answer to the query from the responses of the sources, and returns an answer to the user. A mediator supports a virtual view, or collection of views, that integrates several sources.
An example of a mediation system is the Enterprise Information Integrator (EII) of IBM Corporation, which generates a virtual warehouse. EII supports integrated querying across multiple data sources, including IBM DB2 relational databases, Microsoft SQL relational databases, and XML document databases.
All three of the approaches to information integration described above use transformers, referred to as wrappers or extractors, to transform data when it is extracted from a data source. Wrappers are used to pass ad-hoc queries to data sources, receive responses from the source, and pass information to an information integrator.
A drawback with conventional information integration is the lack of uniformity in semantics, and the lack of traceability back to individual data sources. Each database accessed by a warehouse generally has its own semantics, including inter alia names for tables and their fields, names for XML complex types and their elements, and data formats. It may happen that the same name is used in different contexts within different databases, or multiple names are used for the same construct, perhaps formatted differently for different names. Further complications that can arise with non-uniform semantics include inter alia, different inter-relationships between data constructs, different business rules relating the same data constructs, redundancies and inconsistencies.
It is thus desirable to be able to introduce a common semantic foundation for all of the data sources accessed within a data warehouse, and to provide a translation layer which enables a user to access data using queries expressed in common and meaningful semantics, and buffers the user from the individual semantics for the individual data sources.