1. Field of the Invention
The present invention relates to data processing apparatus and methods for the storage and retrieval of data stored in computerized database management systems. More particularly, the invention relates to deterministically controlling the recovery of the data after a crash has disrupted the system.
2. Description of the Prior Art
A major problem today is providing high availability (HA) of database management system (DBMS) to its users. Increasingly, organizations with DBMSs, such as banks, brokerages, e-tailers, etc., are finding that they cannot tolerate long outages while a DBMS is unavailable or only available at a reduced performance level.
There are two aspects to this problem. The obvious first is the prevention of disruptions or “crashes” of a DBMS to begin with. This is significant, but it is not the subject of this discussion. Rather, we here address the second aspect, the pragmatic fact that disruptions in a DBMS will occur and that the DBMS needs to be recovered rapidly to its full performance level.
FIG. 1 (background art) is a block diagram conceptually depicting the basic elements and operation of a representative DBMS 10. The DBMS 10 includes a database engine 12, a database 14, a buffer pool 16, and a transaction log 18. In operation, pages of data are “paged into” and “paged out” of the buffer pool or cache memory. A “page fault” occurs when a page to be paged into the buffer pool 16 because it is not already there. When a page contains updates that are not yet recorded in the database 14 it is a “dirty page.” The operation of paging out a dirty page from the buffer pool 16 into the database 14 is often referred to as “flushing.” Conversely, when a page with no updates is paged out, this operation is often referred to as “replacing.” Page faults and having to flush dirty pages are generally undesirable because they slow down operation of the DBMS 10.
The buffer pool 16 resides in high speed, volatile memory and the rationale for using it, rather than simply working directly with the database 14, is to increase the efficiency of the DBMS 10 based on the principles of probability. If the DBMS 10 is being used by an application to perform a query or to update a record in a page, that page is likely to contain other records that will soon also be the subject of queries or updates. Accordingly, it is usually desirable to not page out a page after a query or update finishes with it, but rather to retain it in the buffer pool 16 until it has not been accessed for some period of time or until the application using it ends.
Of particular present interest, when an update is performed the database engine 12 needs to page out dirty pages at some point and this is where things get complicated. Unplanned disruptions in the DBMS 10 can occur, causing the contents of the buffer pool 16 to not get properly flushed to the database 14. Such an event is termed a “crash” and the process of restoring the data stored in the database 14 to a transactionally consistent state after such a crash is often referred to as “crash recovery.”
To facilitate crash recovery, a logical representation of each of the updates applied to the pages in the buffer pool 16 is entered into a transaction log 18 that resides in persistent storage. In the unfortunate even of a crash, the transaction log 18 can be replayed to redo all of the committed updates that were applied to pages in the buffer pool 16 but not flushed to the database 14. If there is a large number of records in the transaction log 18 to replay during crash recovery or if the records are expensive in terms of system resources to replay, crash recovery can take a long time.
FIG. 2 (background art) is a block diagram depicting the transaction log 18 as a series of log records 20. Here, “recn” represents the log record 20 that dirtied the oldest unflushed page in the buffer pool 16 and the series “recn, recl, . . . recn+m” then represent log records 20 that need to be replayed.
In passing, it should be noted that recovery time after a crash also includes time to roll back any uncommitted transactions that were open at the time of the crash, but this time is generally negligible compared to the time to do the roll forward portion of recovery. This is because in online transaction processing (OLTP) systems, transactions tend to be very short, so only a few seconds worth of rollback is needed, while in decision support systems (DSS), transactions tend to be long but also read-only, and read-only transactions do not generate any log records and require no rollback.
Various technologies have been developed in attempts to improve crash recovery handling. For example, U.S. Pat. No. 5,625,820 and U.S. Pat. No. 5,574,897 by Hermsmeir et al. disclose methods wherein a user chooses a length of time (an external threshold) that he or she is willing to spend recovering a database, and the system dynamically manages the logging of objects to meet this demand. The available CPU to run the process, the number of I/Os the process generates, and the quantity of lock wait time are taken into consideration. The shorter the time the user chooses the more objects the system will log, but the more the system performance is otherwise degraded in a tradeoff for this. As such, these references teach resource management to achieve a desired recovery time, but where resource management is rigid.
U.S. Published App. No. 2003/0084372 by Mock et al. discloses a Method and apparatus for data recovery optimization in a logically partitioned computer system. This is a method wherein a user may specify the maximum recovery time which can be tolerated for compiled data in a computer system having dynamically configurable logical partitions, and a protection utility manages the logging of indexes so that this maximum recovery time is not exceeded yet unnecessary logging is not performed. The compiled data may be multiple database indexes, which are selectively logged to reduce recovery time. Logging is selectively discontinued or extended responsive to changes in partition configuration, allowing a gradual migration to the target recovery time using the new set of configured resources. As such, however, this invention address recompiling database indexes, rather than the whole process of database recovery, and this invention does this in the context of computer system having dynamically configurable logical partitions.
U.S. Pat. No. 6,351,754 by Bridge, Jr. et al. discloses a method for controlling recovery downtime. A checkpoint value is maintained that indicates which records of a plurality of records have to be processed after the failure. The target checkpoint value is determined by calculating a maximum number of records that should be processed after the failure. As such, however, this approach is not deterministic and its checkpoint value may result in an undue allocation of system resources because of this.
U.S. Pat. No. 5,758,359 discloses a method for performing retroactive backups in a computer system. The retroactive backup mechanism employs a backup policy dictating that certain backups are to be performed at set times and for set backup levels. The total size of a save set is compared to the maximum size threshold and the backup is activated when the threshold is reached. As such, however, this approach is also clearly not deterministic.
In view of the current state of affairs, however, the present inventors determined that the users of DBMSs who have high availability (HA) requirements would still benefit greatly from the ability to specify a maximum crash recovery time (Rmax) that they are willing to tolerate, and to then have the DBMS more efficiently automatically adjust its work of flushing dirty pages to the database after a crash as needed to guarantee that Rmax will not take longer than specified.