Databases typically use permanent, high capacity storage devices to store large volumes of data. In general, the data is managed by one or more computers using a software application with a database engine, which provides an interface to facilitate database queries. A database query specifies the data that should be retrieved. Structured Query Language (SQL) is commonly used to specify database queries. The data specified by a query is delivered to a program or to a user who issued the query.
It is often useful to access data from different sources without having to aggregate its storage or its management. This may involve, for example, studying a problem by regularly updating an “operational dashboard” within a company or group of companies. This may also involve pooling different company databases, while retaining flexibility by allowing different subsidiaries to manage and update its own data. One way of doing this is to produce a “virtual database”. As used herein, the term virtual database refers to a system that facilitates direct access to data from a number of discrete databases. A virtual database comprises one or more target tables, which can be consulted in a similar way to a conventional database. In the same way as in a data table within a non-virtual database, each of these target tables has a schema of its own, i.e., a structure and an organization defining the types of data which it supplies. The records in the target tables are created or instantiated from data originating from the various sources only when requirements dictate, typically at the time of processing a query.
For a target table, instantiation is carried out by means of a query, e.g., an SQL query, the code of which itself comprises the access instructions consulting the different sources involved in order to obtain the necessary data. This query then defines a “view” which supplies the virtual database data and allows access in a similar way to a conventional database. Unfortunately, the SQL code which constitutes the query for this view may demonstrate considerable complexity, resulting in significant work, in particular for analysis, programming and checking. Moreover, if the user wishes to obtain new categories of data by defining new target tables from heterogeneous sources, the schema of the virtual database must be redefined and new queries must be programmed. In addition, the data which populates a database is frequently collected or generated during the activity of an organization. The structure and the organization of this database is in general determined in accordance with the needs or the capacities of the organization. This data is then organised to be easily accessible and usable in accordance with the criteria of the organization. Therefore, various entities often have data in their databases that is not directly compatible or coherent. Such diversity is often the origin of a heterogeneity between different sources of data accessible by a single person or entity, for example between different departments in the same organization or after a merger or a pooling of resources by several organizations. Such sources can be heterogeneous between each other from the point of view of hardware or software and from the point of view of the structure or the organization of the data. This heterogeneity can also include an absence of data in some fields or inconsistencies or contradictions between data listed as representing the same information. In order to have data available that is usable from heterogeneous sources from the hardware or software point of view, compatibility and location tools exist or can be programmed to enable communication between given types of computer or database platforms.
Additional complexity arises from the fact that each source of data often has its own schema, i.e., structure and organization of data. Therefore, it is difficult to prepare standardized interface tools. The combined use of several heterogeneous sources often requires a significant customized development task, depending on the data which the user wishes to obtain.
Therefore, the production of a virtual database, in particular from sources which are heterogeneous from the point of view of their schemas, often represents a significant and delicate task, needing a high degree of skill. In addition, in order to include all the different source schemas, a programmer may have to start with a complete, in-depth study of all these sources, which is also a long and difficult task, and represents a significant cost. For such a study, it is common for example to ask various experts for the sources in question to assemble detailed documentation covering the specifications of the sources which they are familiar with. It is a significant task to study the resultant documentation. Moreover, the subsequent programming for a custom virtual database requires significantly detailed and complex code. The test and validation operations then represent an onerous and complex task, both for the programmer and for the users of the virtual database.
For the same reasons, all subsequent operations to maintain or upgrade this code require caution and an awareness of complex dependencies. It has been proposed that part of this type of task should be automated by means of software tools using a global approach to automatically provide a relationship between the source data and a target schema. After identifying possible relationships a complete list of possible relationships is prepared. By systematic tests covering a certain number of examples, and given the results obtained by these different relationships, the user must then intuitively determine the relationships which the virtual database will produce. Unfortunately, such a technique can provide a very large number of options. Moreover, the fact of choosing from tests on sample data poses the risk that the choices made will not be sufficiently valid in a general way for all the source data, current or future.
It is also known to automatically execute a semantic analysis of the internal structure of a source schema from its XML model. The results of this analysis are then used to define an association between the source schema and the target schema. However, this technique requires possession of an XML model of the schemas and is not directly usable with a source in the form of a relational data table. In addition, it does not take into consideration any association between groups of different or heterogeneous sources.
Finally, such methods do not significantly resolve the difficult problems, which are the extent of the knowledge to be collected in order to produce such a virtual database or the volume and complexity of the code obtained.
In view of the foregoing, it would be desirable to provide an improved technique for producing virtual databases from heterogeneous data sources.