1. Field of the Invention
The present invention generally relates to data processing and more particularly to issuing a query against at least one database that may not contain all the data entities involved in the query.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system (RDBMS) is a database management system (DBMS) that uses relational techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application or the operating system) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL). Illustratively, SQL is used to make interactive queries for getting information from and updating a database such as International Business Machines' (IBM) DB2, Microsoft's SQL Server, and database products from Oracle, Sybase, and Computer Associates. The term “query” denominates a set of commands for retrieving data from a stored database. Queries take the form of a command language that lets programmers and programs select, insert, update data, and so forth.
With advances in information technology (IT), the data accessible by queries becomes more distributed and diversified (i.e., located on more than one database). For example, a patient's records (diagnosis, treatment, etc.) may be stored in one database, while clinical trial information relating to a drug used to treat the patient may be stored in another database. Accordingly, it is becoming increasingly necessary to access data from multiple databases and integrate the information retrieved into a representation which meets the needs of the application and end users of the application.
Unfortunately, in many circumstances, the structure and content of the different databases being accessed may not be consistent. For example, there may be overlap in the information found across multiple databases (e.g., the same fields occur in more than one database). More troubling, however, is the situation where information is available in one data source, but not another. This complicates the task of accessing and correlating information across these databases since queries must be constructed that contend with this missing data (or “incomplete schema”) problem.
One approach to this incomplete schema problem is to gather the data that is available from each of the databases and subsequently group the data together as desired, filling in the missing data from one database with data from another database, where possible. However, query languages like SQL are very rigid in their requirement that the schema (e.g., organization of data into fields or “entities” within the database) of the underlying database match all entities referenced by the query. Therefore, this approach conventionally requires a significant level of understanding of the underlying schema for each database and requires unique query statements to be written for each database to be queried.
Another approach to this incomplete schema problem involves “joining” multiple databases together based on common keys that would associate an item in one database with an item in another. However, this approach adds complexity to the query, which must include, not only predicates for data selection, but predicates and join logic for combining data from the multiple databases. This approach is also inefficient if the databases involved are large and distributed, as the (processing) cost of joining information from the large databases would be prohibitive.
Accordingly, there is a need for an improved method for issuing queries against multiple databases, particularly when the queries involve data missing from one or more of the databases.