1. Field of the Invention
The present invention generally relates to data processing and more particularly to maintaining a data warehouse containing fields of data originating from one or more data source.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system is a computer database management system (DBMS) that uses relational techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application or the operating system) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL). Illustratively, SQL is used to make interactive queries for getting information from and updating a database such as International Business Machines' (IBM) DB2, Microsoft's SQL Server, and database products from Oracle, Sybase, and Computer Associates. The term “query” denominates a set of commands for retrieving data from a stored database. Queries take the form of a command language that lets programmers and programs select, insert, update, find out the location of data, and so forth.
Historically, the creation of even relatively simple queries required intimate knowledge of both the database and the underlying query language and, therefore, typically required the assistance of a database expert. However, recent advances in database technology have made it relatively easy for non-expert users to build complex queries designed to return required data. These advances have given rise to a number of new problems, however. For example, by making the database easy to access, the likelihood that many people will be running queries is increased, which may create a heavy load on system resources. As an illustration, at a large healthcare research facility, literally thousands of users may have the ability to run queries at any given time, placing a large demand on a server running the database against which the queries are run. Such a demand may prevent or slow daily transactions accessing the same database (entering/updating patient records, test results, etc.).
One approach to prevent this demand from crippling daily operations involving such transactions is to replicate data from database servers used for daily operations to a data warehouse dedicated to receiving research queries. One challenge to this approach is to determine which data should be replicated. Because it may be very difficult to determine what data is important (i.e., likely to be queried) and what data is not important, every possible queryable field may be replicated to the data warehouse. However, because the data warehouse may accept data from a number of different data sources, the data warehouse may have to be prohibitively large in order to grow at a rate to keep up with the growth of all these sources combined. Further, as the sources grow, replicating data from each source may place a tremendous strain on a network as updates are sent to keep the data warehouse current.
Accordingly, there is a need for an improved method for maintaining a data warehouse of replicated data, such that network traffic and the size of the data warehouse are optimized.