In a stand-alone system, system failure recovery, also known as crash recovery, usually consists of two standard processing phases: a forward redo phase and a backward undo phase, shown representatively at FIG. 1. Log files, in which are recorded all operations which result in changes to the database state, are replayed to recreate the event sequence for each transaction. The log files for stand-alone systems are stored on local disks, while multi-node systems may have log files stored locally (i.e., where generated) or at a central location. If a commit log record associated with a transaction is found, then the transaction is committed. If no record is found, the transaction is aborted.
The two-step recovery process phases are commonly referred to as forward recovery and backward recovery. In the forward recovery phase (steps 101-103) of FIG. 1, the node scans the log files forward from a point determined by the checkpoint records, at 101, and redoes all operations stored in the local log files (also referred to as the "repeat history") to establish the state of the database right before the system crashed. To redo the operations, the node reapplies the log to the database and refreshes the transaction table, at 102. Once a check, at 103, determines that there are no more logs to process, the backward recovery phase (steps 104-108) is conducted. In the backward recovery phase, all interrupted transactions (a.k.a., "in-flight" transactions) are rolled back (i.e., aborted). A list of all interrupted transactions is obtained at step 104. If the list is empty, as determined at step 105, crash recovery is complete. If the list is not empty, the node scans the logs backward and undoes (i.e., aborts) the interrupted transactions at 106, and then updates the list at 107. The procedure is repeated until the list is empty and the crash recovery is done, as indicated at 108.
In a stand-alone system, the database will become consistent after these two phases of recovery. In a parallel system, however, node failures or other types of severe errors which may occur during commit processing will cause transactions to be out-of-sync across multiple nodes. Recovery across the multiple nodes is not as straight-forward as it is in a stand-alone system. Although the standard recovery process for multi-node systems does involve each node independently executing the two-step process, database consistency cannot be guaranteed across nodes, due to the nature of the commit protocol.
In what is referred to herein as the "standard two-part commit protocol," a coordinating node, at which a transaction is executing, first issues a "prepare to commit" message to all participating, or subordinate, nodes. After receipt of responses from all participating nodes, the coordinating node then issues an outcome message in the second phase of the protocol, either a "commit" message if all nodes have sent affirmative responses, or an "abort" message. All participating nodes and the coordinating node must vote "yes" for the coordinating node to commit/complete the transaction. Any "no" response received will result in the aborting of the transaction. In response to the outcome message ("commit" or "abort") generated by the coordinating node, all participating nodes perform local commit procedures or the transaction is aborted. Before issuing a "yes" reply to the coordinating node, each participating node writes a "prepare" log to its local disk. Similarly, before sending the "commit" message to all participating nodes, the coordinating node writes a "commit" log to its local disk. Finally, after local commit processing has been completed, a participating node writes a "commit" log to its local disk and acknowledges the commit transaction completion to the coordinating node. In addition, a transaction table entry for the corresponding transaction is updated at each local node after voting or performing a commit procedure. When the coordinating node receives an acknowledgement from all participating nodes, it removes the corresponding entry from the transaction table, writes a "forget" log record to disk, and "forgets" about the transaction.
For aborted transactions, typically, the protocol will not require that each participating node generated an acknowledgement message to the coordinating node, although such can readily be implemented. Before a forget log is written at the coordinating node, a transaction can be in the committed state, but not yet in the forgotten state. Similarly, a participating node can have prepared to commit, and yet not received the outcome message from the coordinating node. If a crash occurs before the transactions are resolved, the interrupted transactions cannot readily be traced and replayed under the prior art two-phase recovery procedure. Moreover, the transaction may have been committed at one node, and not at another, resulting in database inconsistency across the nodes. What is needed is a process by which a given transaction can be traced to the point of interruption, and also may be "resurrected" for completion.
It is therefore an objective of the invention to provide an improved crash recovery mechanism for database recovery across multiple nodes.
It is another objective of the invention to provide crash recovery which can effectively identify and resolve interrupted transactions.
Yet another objective of the invention is to provide a mechanism by which the database can be accessed before completion of the crash recovery process.