Modern computer systems often involve multiple, individual processors or nodes which are interconnected via a communication network. Large amounts of information are often stored and processed in such systems. In addition to processing equipment, each node typically has digital storage devices (e.g., magnetic disks) for storing the information. The information is often arranged as a database that occupies the available storage space at the various nodes in the system.
The techniques employed for arranging the required storage of, and access to a database in a computer system with multiple nodes are dependent on the requirements for the specific system. However, certain requirements are common to most systems. All data in the database should be available for access from any node in the system. The amount of storage overhead and processing overhead must be kept at a minimum to allow the system to operate efficiently, and the storage/access strategy must generally be immune to failure occurring at any one node.
Two general techniques for database storage, or partitioning, are employed in modern systems. The first, data sharing, involves providing physical access to all disks from each node in the system. However, to maintain coherency of the database, global locking or change lists are necessary to ensure that no two nodes inconsistently change a portion of the database.
The second technique of data storage involves physically partitioning the data and distributing the resultant partitions to responsible or owner nodes in the system which become responsible for transactions involving their own, corresponding partitions.
This "shared nothing" architecture requires additional communication overhead to offer access to all of the data to all nodes. A requesting node must issue database requests to the owner node. The owner node then either: (i) performs the requested database request related to its corresponding partition (i.e., function shipping) or (ii) transfers the data itself to the requesting node (i.e., I/O shipping).
A problem with the shared nothing approach is the potential for failure at any one node and the resultant inability of that node to accept or process database requests relating to its partition.
Two principal methods are currently known for recovery of a node failure in a shared nothing database system: (i) asynchronous replication, where updates to the data are sent to a replica asynchronously (see e.g., "An Efficient Scheme for Providing High Availability," A. Bhide, A. Goyal, H. Hsiao and A. Jhingran, SIGMOD '92, pgs. 236-245, incorporated herein by reference); and (ii) recovery on a buddy node to which disks of the failed node are twin-tailed-connected. Twin-tailing disk units to buddy processing nodes is known in the art, and involves a physical connection between a single disk and more than one processing node. In one mode of twin-tailing, only one node is active and accesses the disk at any one time. In another mode of twin-tailing, both nodes are allowed to access the disk simultaneously, and conflict prevention/resolution protocols are provided to prevent data corruption.
The primary advantage of method (i) is that it can recover from either disk or node failures, however the primary disadvantages of this method are that data is mirrored, consuming twice the disk capacity, and the overhead involved during normal failure-free operation for propagating data to the replica. The primary advantage of method (ii) is that there is no overhead during normal operations, however the primary disadvantage is that after a failure, twice the load is imposed on the buddy node and this can lead to half the throughput for the entire cluster, because query scans or transaction function calls to the buddy node of the failed node become the bottleneck for the entire cluster.
What is required, therefore, is a technique for recovery from a processing node failure in a shared nothing database processing system, which does not incur significant processing overhead during normal operation, or storage space overhead for full data replication.