1. Field of the Invention
This invention relates to failure recovery in a computer system and more particularly relates to computer failure recovery in a transactional processing system.
2. Description of the Related Art
Computer systems including transactional processing consisting of a cluster of computers logically connected to each other through a shared memory controller and sharing disks and data often support high transaction rates and high availability for on-line transaction processing (OLTP) and other applications. Clustering systems of multiple computers may execute both on-line transactions and non-interactive work. Non-interactive work, such as batch jobs including updates, can concurrently share data with on-line transaction processing. Multiple batch jobs and on-line transactions can be run against the same files. The computer system ensures data reliability and availability for batch updates while the OLTP server ensures them for on-line updates. A computer or OLTP server may lock a resource such as a portion of a disk while accessing the disk.
A computer generally provides a recovery function that automatically restores updated resources to the before-update states and releases resources locks. This recovery function is generally initiated following the termination of a batch job conducting transactional processing. The recovery function uses a system undo log recorded before resources were changed to back out transactions active at the time of failure. Unfortunately, recovery after a computer failure can take a long time, and the process is not automatic. In-flight transaction updates can thus remain for a long time, making locked resources unavailable to on-line transaction processing and other non-interactive jobs on active peer computers in the cluster. In such cases, even a peer computer running on an active system cannot back out the in-flight transaction updates of the failed computer, because the peer computer cannot normally access the private undo log maintained by the failed computer. Furthermore, the failed computer may try to restart by itself, compounding the recovery problem.
What is needed is a method, apparatus, and system that allows a computer failure recovery to be performed expeditiously by one and only one peer, enables the peer computer to access log records privately held by the failed computer for a transaction backout, and prevents the failed computer from restarting until after the peer recovery. Beneficially, such a method, apparatus, and system would accelerate computer failure recovery.