1. Field of the Invention
The present invention relates to computer databases. More particularly, the present invention relates to techniques for creating a database abstraction model over of a set of individual, differently formatted databases used to store logically similar or related data.
2. Description of the Related Art
Databases are well known systems for information storage and retrieval. The most prevalent type of database used today is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A relational database management system (DBMS) uses relational techniques for storing and retrieving data.
A database schema describes the structure of a database. For example, a relational schema describes a set of tables, columns, and primary and foreign keys defining relationships between the tables. Applications are developed that query data according to the database schema. For example, relational databases are commonly accessed using a front-end query application that is configured to perform data access routines, including searching, sorting, and query composition routines. Front-end applications query data according to the relational schema. At the back-end, software programs control data storage and respond to requests (queries) sent by users interacting with the front-end.
Every year, entities spend millions of dollars on data-modeling experts to design data warehouses. A data warehouse provides a central repository for an aggregation of data from many sources into a single, centralized database. Data warehousing, however, is often not an appropriate or even workable model. In some cases, the amount of data is too large or too rapidly changing to be aggregated into a single location. This is the case, for example, with many large bio-informatics databases. In other cases, the number of data sources can be very large, making the cost of creating a data warehouse prohibitive.
One alternative to data warehousing is database federation. Database federation allows relational tables on remote systems to appear as tables on a local system. A federated database is logical union of multiple underlying databases running on independent servers that do not share common resources and are connected across a network. Applications may then be configured to access data from the remote systems without the need to aggregate the data in a data warehouse.
Commonly assigned U.S. patent application Ser. No. 10/083,075 (the '075 application), filed Feb. 26, 2002, entitled “Application Portability and Extensibility through Database Schema and Query Abstraction”, discloses a framework that provides an abstract view of an underlying database system and schema. The framework of the '075 application provides a requesting entity (i.e., an end-user or front-end application) with an abstract representation of data stored in an underlying physical storage mechanism, such as a relational database. In this way, the requesting entity is decoupled from the underlying physical data when accessing the underlying DBMS. Abstract queries based on the framework can be constructed without regard for the makeup of the physical data. Further, changes to the physical data schema do not also require a corresponding change in the front-end query application; rather, the abstraction provided by the framework can be modified to reflect the changes. Commonly assigned, U.S. patent application filed herewith entitled “Abstract Query Plan” discloses techniques for processing an abstract query that include generating an intermediate representation of an abstract query then used to generate a resolved query.
Using the database abstraction model, logical fields are defined that describe the data stored in an underlying database, including a federated environment where data is stored in multiple database systems. Users compose abstract queries by selecting logical fields and specifying conditions. The system then retrieves data by generating a resolved query of the underlying databases from the abstract query. Each logical field includes an access method that specifies how to retrieve data from an underlying database system corresponding to the logical field. If there is more than one source for data corresponding to the logical field, then multiple access methods may be defined for a single logical field, each one mapping it to a different database.
Simply having more than one access method per logical field and merging query results retrieved for each one, however, is inadequate to always retrieve the correct query results. That is, this approach does not necessarily ensure that the desired information is retrieved. Moreover, performing a query multiple times (once per data source), is inefficient, and also fails to account for the relationships data has with other tables in the underlying database. Relationships between tables, however, are fundamental to the organization of data in any relational database. Thus, if query results depend on the relationships about data spread across multiple data sources, iteratively performing a query for differently formatted databases will fail to retrieve the correct results. For example, a first logical field may map to demographic data stored in a single table of a database. Data from one column of the table is related to data in other columns on a row-by-row basis. A second logical field might map to logically similar demographic data stored in another database using two tables, and data is related between the two tables using a defined relationship, such as a patient ID stored in both tables. Without an indication of this relationship, linking data between the first database (using a single table) and the second database (using multiple tables) becomes impossible.
These kinds of differences in the format and schema of multiple database systems present challenges to creating a database abstraction model. Consider a research institution with multiple locations. Each location may maintain its own database system used to store logically similar information. Thus, each location may maintain a database that includes a demographic table (storing information about patients), a test table (storing results from tests), and so forth. Further, the database schema used at each location may use differently formatted tables. For example, a first location may use a single table to store patient demographic data, where a second location may subdivide patient demographic data over multiple tables. In such a scenario, the same patient may have data stored in databases at the multiple locations of the research institution. A logical field may be defined that retrieves data from each respective demographic table using multiple access methods. An abstract query that includes such a logical field may, therefore, map to multiple tables that have desired data. Because the multiple tables are organized differently, a simple union of data from each table may fail to produce desired query results.
Accordingly, there remains a need for techniques to construct a database abstraction model over multiple databases that store logically related or similar information using differently formatted physical tables.