Relational Database Management Systems (RDBMS) have become an integral part of enterprise information processing infrastructures throughout the world. An RDBMS 100, as shown in FIG. 1, maintains relational data structures called “relational tables,” or simply “tables” 105. Tables 105 consist of related data values known as “columns” (or “attributes”) which form “rows” (or “tuples”).
An RDBMS “server” 110 is a hardware and/or software entity responsible for supporting the relational paradigm. As its name implies, the RDBMS server provides services to other programs, i.e., it stores, retrieves, organizes and manages data. A software program that uses the services provided by the RDBMS Server is known as a “client” 115.
In many cases, an enterprise will store real-time data in an operational data store (ODS) 200, illustrated in FIG. 2, which is designed to efficiently handle a large number of small transactions, such as sales transactions, in a short amount of time. If the enterprises wishes to perform analysis of the data stored in the ODS, it may move the data to a data warehouse 205, which is designed to handle a relatively small number of very large transactions that require reasonable, but not necessarily instantaneous response times.
To accomplish this, data is “imported,” or “loaded” (block 210) from various external sources, such as the ODS 200, into the data warehouse 205. Once the data is inside the data warehouse 205, it can be manipulated and queried. Similarly, the data is sometimes “unloaded” or “exported” from the data warehouse 205 into the ODS 200 or into another data store. Since both load and unload processes share many similarities, in terms of the processing they perform, they will be referred to hereinafter as “database loads” or “loads.”
A database load is typically performed by a special purpose program called a “utility.” In most cases the time required to perform a database load is directly proportional to the amount of data being transferred. Consequently, loading or unloading “Very Large Databases” (i.e. databases containing many gigabytes of data) creates an additional problem—increased risk of failure. The longer a given load runs, the higher the probability is that it will be unexpectedly interrupted by a sudden hardware or software failure on either the client 115 or the server 110. If such a failure occurs, some or all of the data being loaded or unloaded may be lost or unsuitable for use and it may be necessary to restart the load or unload process.
“Parallel Processing,” a computing technique in which computations are performed simultaneously by multiple computing resources, can reduce the amount of time necessary to perform a load by distributing the processing associated with the load across a number of processors. Reducing the load time reduces the probability of failure. Even using parallel processing, however, the amount of data is still very large and errors are still possible.
One traditional approach to handling errors in non-parallel systems is called “mini-batch” or “checkpointing.” Using this approach, the overall processing time for a task is divided into a set of intervals. At the end of each interval, the task enters a “restartable state” called a “checkpoint” and makes a permanent record of this fact. A restartable state is a program state from which processing can be resumed as if it had never been interrupted. If processing is interrupted, it can be resumed from the most recent successful checkpoint without introducing any errors into the final result.
Applying checkpointing to a parallel process is a significant challenge.