The invention relates generally to database management. More specifically, the invention relates to a method and system for maintaining the history of database content where the database is modified by a sequence of events.
An institutional database often contains thousands to billions of individual records. An update to an institutional database is often performed through a process termed a load. A load is performed by taking input data from an input table or file and populating the database with the data, where the data in the input and in the database is correlated using a primary key.
Generally, throughout the life of a database, multiple loads are performed. After each load, the database reflects the information populated into the database by the most recent load. When a load is performed, the load may have various effects on the database.
First, the load may insert rows into the database with new key values. Second, the load may update rows with existing key values, but different non-key values. Third, the load may delete rows with those key values, which are not present in the load, but which are present in the database prior to the load.
After a load has been completed, there is generally no way to “undo” the load, other than by restoring the database to some state that it had prior to the load, and then partially repeating the sequence of events performed to arrive at the present database. That is, if the database has been updated with ten loads, and it is realized that the seventh load (or any other load) contained bad data, the only way to undo the seventh load is to start from the state of the database as it existed before the first load, and then repeat loads one through six, followed by loads eight through ten. This process becomes exceedingly difficult when the number or size of loads is large or the need to undo loads is frequent.
It is known that backups of the database may be stored for archival and restore purposes periodically and even after each load. However, with large databases, restoring a database from a previously stored backup and then partially repeating remaining loads just to undo the load may consume an unacceptable amount of time.
Implications of the difficulties of addressing problems with loads are shown through the following example. Data warehousing companies often receive information from multiple data sources and maintain a single database, or data warehouse, containing combined data. However, there is a problem when one of the data suppliers not only stops supplying data, but also informs the data warehousing company that they can no longer use the data that the supplier has previously supplied. In such a case, the loads performed using data received from the supplier must be retracted, such that the database reflects a state as if the retracted load(s) never occurred. It is possible to load an archived copy of the database as it previously existed immediately before the load to be retracted occurred. However, the subsequent sequence of events (with the exception of the retracted loads) must still be performed.
This problem is further aggravated by the fact that each data supplier often provides in its loads only a subset of the information collected by the data warehousing company. For example, while one supplier of demographic data might provide information such as name, address, city, state, zip code, and telephone number, a second supplier might provide name, zip code, age, sex, and race. Data warehousing companies often want to use this data from multiple data suppliers to provide useful information based on the combined data, but are often restricted by one or more suppliers in their use of the data supplied by that individual supplier. Data warehousing companies need an efficient system that can retract data supplied by one or more suppliers on one or more occasions.
There is also a problem when the state of a database just before a load was retracted needs to be known. For instance, a data supplier supplied a load containing bad data. The load containing bad data is discovered, and the load is retracted in a subsequent event. A load containing good data is then performed. However, reports (sent to clients, customers, etc.) may have been generated based on the bad data before the load with the bad data was retracted. In such a case, the database manager may want to duplicate the reports based on the bad data for audit or management purposes. Presently, there is no way to do this other than to reconstruct the database using the initial sequence up to and including the load containing bad data, as load retractions cannot subsequently be ignored. Thus, an improved database load management method and system is needed that addresses the above shortcomings.