The present invention concerns database management systems (DBMS) and pertains particularly to recovery from crashes in DBMSs.
DBMSs have had crash recovery for many years. For background information on architectures and algorithms that support database recovery, see for example, Gray, Jim and Reuter, Andreas, Transaction Processing: Concepts and Techniques, Morgan Kaufinann Publishers, Inc., 1993. While variations on techniques for database recovery exist, for the most part they can broadly be described as being based on write-ahead logging. Write-ahead logging means that database updates are first written to a database log file on a disk storage system before being applied to the database. If the database crashes, the state of the database as of the time of the crash can be recovered by analyzing the database log file and performing both undo recovery and redo recovery. Undo recovery involves rolling back the database updates for all transactions that were in progress but uncommitted at the time of the crash. Redo recovery involves re-applying any database updates for transactions that committed between the time of the last checkpoint and the time of the crash.
Most of the commercial database management systems utilize some form of write-ahead logging and perform conventional disk log-based recovery, as described above. The time required to perform crash recovery depends on several factors, such as the frequency of checkpoints, the size of the buffer pool, and the rate of page updates. The time to perform database recovery can take less than a minute if there was no update activity at the time of the crash and the database storage was in a consistent state at that time. Other than that special case, database recovery time ranges from a small number of minutes to tens of minutes to over an hour, depending on the above-listed factors as well as other factors.
The technique of write-ahead logging and performing conventional disk log-based recovery for a database poses a challenge to DBMS vendors and platform vendors who are striving to reduce the time required to make a database available after a system crash. For example, it is desirable to be able to guarantee less than one-minute end-to-end client transparent database failover. A database failover is a complete recovery from a failure in a database. Further, it is desirable that the end-to-end client transparent database failover be independent of the database workload. In order to do this, techniques other than conventional disk log-based recovery have been considered.
Database failover techniques which attempt to provide fast (i.e., less than one minute) end-to-end client transparent database failover fall into three categories: conventional DBMSs with frequent checkpoints, parallel databases with mutual recovery/takeover, and specialized fast-failover DBMSs based on proprietary hardware and operating systems.
In order to implement a conventional DBMS with frequent checkpoints it is necessary only to run an available database product, such as Oracle 7 available from Oracle Corporation, having a business of 100 Oracle Parkway, Redwood Shores, Calif. 94065, Informix ODS available from Informix Software, Inc., having a business address of 4100 Bohannon Drive, Menlo Park Calif. 94025, or Sybase System 11 available from Sybase, Inc., having a business address of 6475 Christie Avenue, Emoryville, Calif. 94608, with the checkpoint interval set so that checkpoints occur very frequently. The above listed database products are "single-instance" database products, i.e., they run as a single database instance on a single computer node, which may be a uni-processor or a symmetric multi-processor (SMP). While this technique provides a reduction in the time required for crash recovery, it does not provide guaranteed recovery time within a small number of minutes, particularly if there is a high rate of updates. Also, this approach has a substantial impact on runtime performance due to increased disk utilization by a page cleaner daemon. This overhead may be unacceptable for high-throughput database applications.
Parallel databases with mutual recovery/takeover utilize multiple database instances that mutually recovery a failed instance. For example, Oracle Parallel Server (OPS) available from Oracle Corporation implements mutual recovery/takeover utilizing multiple database instances using a shared-disk model. Informix XPS available from Informix Software, Inc., Sybase MPP available from Sybase, Inc., and IBM DB2 Parallel Edition, available from IBM Corporation, having a business address of 650 Harry Road, San Jose, Calif. 95120, implement mutual recovery/takeover utilizing multiple database instances without using a shared-disk. While parallel databases with mutual recovery/takeover have the advantage of providing fast recovery in the sense that the surviving instances will initiate recovery of the failed instance as soon as they are notified of the failure, the recovery is still based on conventional disk-based log recovery, and therefore is still not guaranteed to be completed in a small number of minutes for any database workload.
One example of the use of specialized fast-failover DBMSs based on proprietary hardware and operating systems are systems developed by Tandem Computers, Inc., having a business address of 18922 Forge Way, Cupertino, Calif. 95014, which has developed its own specialized DBMS software and proprietary hardware and operating system to achieve a fast-failover database product. These systems utilize redundant hardware and software components. Fault tolerance is provided by coordinating primary and backup processes on nodes that are connected by a high-speed interconnect. This process-pair technology allows for a fast takeover of database operations by the backup system if the primary should fail.
Although this use of redundant hardware and software components is considered a fault-tolerant system, the problem with this approach is that it is based on proprietary hardware and operating system, and is therefore less competitive from a price/performance perspective. The industry trend is rapidly moving toward building scalable, highly available (HA) DBMS systems from commodity off-the-shelf (COTS) technology. As this trend evolves, the scalability and HA characteristics of these systems may approach that of the proprietary fault-tolerant systems, and at a substantially reduced price. An example of a HA infrastructure is HA cluster products available from Hewlett-Packard Company, having a business address of 3000 Hanover Street, Palo Alto, Calif. 94304. In the HA cluster products available from Hewlett-Packard Company, the HA infrastructure is provided by MC/ServiceGuard high availability clustering system, also available from Hewlett-Packard Company, operating on the HP-UX operating system, available from Hewlett-Packard Company as well.