1. Field of the Invention
The present invention relates to computer databases. More particularly, the present invention relates to techniques for creating a data abstraction model over of a set of individual databases that includes constraints on how logically related data sets are joined together and presented to a user.
2. Description of the Related Art
Databases are well known systems for information storage and retrieval. The most prevalent type of database used today is the relational database, i.e., a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A relational database management system (DBMS) uses relational techniques for storing and retrieving data.
A database schema describes the structure of a database. For example, a relational schema describes a set of tables, columns, and primary and foreign keys that define relationships between different tables. Applications are developed that query data according to the database schema. For example, relational databases are commonly accessed using a front-end query application that is configured to perform data access routines, including searching, sorting, and query composition routines. At the back-end, software programs control data storage and respond to requests (queries) sent by users interacting with the front-end application.
One issue faced by data mining and database query applications, however, is their close relationship with a given database schema. This relationship makes it difficult to support an application as changes are made to the corresponding underlying database schema. Further, this tightly bound relationship inhibits the migration of a query application to alternative data representations.
Commonly assigned U.S. patent application Ser. No. 10/083,075 (the '075 application), filed Feb. 26, 2002, entitled “Improved Application Flexibility Through Database Schema and Query Abstraction,” discloses a framework that provides an abstract view of a physical data storage mechanism. The framework of the '075 application provides a requesting entity (i.e., an end-user or front-end application) with an abstract representation of data stored in an underlying physical storage mechanism, such as a relational database. In this way, the requesting entity is decoupled from the underlying physical data when accessing the underlying DBMS. Abstract queries based on the framework can be constructed without regard for the makeup of the physical data. Further, changes to the physical data schema do not also require a corresponding change in the front-end query application; rather, the abstraction provided by the framework can be modified to reflect the changes. Commonly assigned, U.S. patent application entitled “Abstract Query Plan”, Ser. No. 11/005,418, filed Dec. 6, 2004 discloses techniques for processing an abstract query that include generating an intermediate representation of an abstract query then used to generate a resolved query which is consistent with the underlying database.
Oftentimes, relationships exist between data elements that are not captured by the table structure of a relational database. For example, consider a set of tests that make up a test suite (e.g., a set of toxicity tests given to a patient brought to the emergency room). Although each test is independent of or distinct from the others, the multiple tests are related and collectively form a set. Another relationship not captured by a relational database may be independent events that together form a series. A series of events may be ordered based on the sequence of individual events included in the series. The events may be different, but may also be the same event type repeated multiple times. For example, many treatment regimens or research experiments may be conducted sequentially. In addition, researchers often wish to identify patterns present in data. For example, a researcher may wish to form a set: event “A,” event “B,” and event “C” to seek a correlation to outcome “X.” Similarly, a series (e.g., event “A,” then event “B,” and then event “C”) may be defined as a sequence of events used to identify a possible outcome.
Data from the tests may be stored in a single column of a test table with an additional column that indicates the test type. Table I, below, is an example of such a table. This tabular arrangement allows results from new tests to be added without requiring a structural change to the relational schema. To the average user, however, it is very surprising that test results are often not stored together as a result set in the database. Table II illustrates a tabular arrangement that users might expect in that Table II is consistent with the users' logical perspective of the physical data.
TABLE IExample Table - ActualIDResultTypeDateTest Run112Test 1Nov. 3, 20041145Test 2Nov. 4, 200411203Test 3Nov. 5, 2004119Test 1Nov. 20, 20042147Test 2Nov. 21, 200421198Test 3Nov. 22, 20042
TABLE IIExample Table - ExpectedIDNameTest 1Test 2Test 31Dave12452031Dave947198
However, arranging a relational table consistent with the users' logical view of these relationships (e.g., as in Table II) leads to inefficient or un-maintainable database design. A new table would need to be added for each new test or test regimen. Presenting the tests as they are stored in Table I, however, makes it difficult for users to interpret data. Accordingly, it may not always be possible or desirable to make the physical environment consistent with the users' logical perspective. In other cases, the disparity between the physical environment and the users' logical perspective of the physical data is accidental (i.e., due to poor development of the physical environment) rather than an intentional design choice. Regardless of the cause, the disparity inhibits users' ability to compose queries that return expected results.
Accordingly, there remains a need to extend the capabilities of an abstract database to account for the logical relationships between logical fields that may not be reflected by the underlying physical database schema.