This invention relates generally to database transactions on fault-tolerant multi-processor systems. In particular, this invention relates to methods for flushing in the commit phase of database transactions on cluster computer systems.
FIG. 1 illustrates a network node 100 in a multi-node system of the prior art. In FIG. 1, the node 100 includes loosely coupled processors 110 containing execution spaces 120 connected by a bus 130. The system 100 is a flat arrangement of the processors 110.
This bus-and-processor arrangement constitutes a single network node 100 on a network 140. The constituent processors 110 of the network node 100 have no shared memory processor (SMP) characteristics, e.g., memory sharing between some of the processors 110, and have no separate network presence.
The systems 100 and a subset of the processes thereon cooperate to provide a transaction service. The transaction service includes three elements: a commit coordinator, a resource manager and a Log. Each of the elements is a fault-tolerant process pair having primary and backup processes.
The primary and backup of each process pair are located at the same network address, i.e., at the address of the single network node 100 running both processes. Thus, for example, if the node 100 of the primary commit coordinator process becomes unavailable to the network 140, the backup commit coordinator process becomes offline as well. Process pairs implementing transaction services are described in the book entitled xe2x80x9cTRANSACTION PROCESSING: CONCEPTS AND TECHNIQUESxe2x80x9d, by Gray et al., 1993, Morgan Kaufmann Publishers, Inc, San Mateo, Calif., at pages 132-138.
A standard two-phase commit algorithm is described pages 562-568 of the above referenced book by Gray et al. The two-phase commit algorithm involves the following steps:
PREPARE: send a flush broadcast invoking each resource manager involved in the transaction to vote on whether to commit;
DECIDE: collect flush results of voting, if all vote yes write the transaction commit log record;
COMMIT: invoke each involved resource manager telling it the commit decision; and
COMPLETE: when all acknowledge the commit message force-write a commit completion record to the log.
The prepare phase is also called phase 1 of the commit and commit phase is called phase 2.
In a prior art system a primary and backup commit coordinator are both located on a single network node. Any processor failure of other node related failure causes the entire node to become inoperative, i.e., the granularity of failure is the entire node. The sharing of a network address between primary and backup commit coordinator processes in the prior art system 100 prevents that system from being non-blocking because a failure of the node at shared network address disables the commit operation. The flushing of resource managers in such an arrangement is not truly non-blocking in the classic network sense.
Accordingly, one goal of the invention is a transaction processor in which processors are either connected to each other using SMP memory sharing with tightly-coupled synchronization primitives (first tier) or connected across the network (second tier).
Such a configuration is two-tiered, with xe2x80x9cnear processorxe2x80x9d and xe2x80x9cfar processor/nodexe2x80x9d relationships. The prior art configuration has two execution space contexts: here and there. The new configuration has three execution contexts: here, near-there, and far-there.
According to one aspect of the invention, a transaction service includes a three-phase algorithm requiring a backup commit coordinator process at a different network location than the primary.
According to one aspect of the invention, the primary and backup commit coordinator processes in the process pair are executing on different nodes having different network processes. Upon receiving the flush results the primary commit coordinator synchronizes the results to the backup commit coordinator utilizing a network message system so that the flush results are durably recorded at separate network nodes. Thus, the failure of any systems on either node will not result in a loss to the flush results.
According to another aspect of the, all processors in the node are coupled to a shared memory. Messages between processors in a node are implemented by memory copying. Each processor has an associated execution space in the shared memory with processes being attached to an execution space. During synchronization the messages are transferred from the execution space having the primary commit coordination attached in a first node to the execution space having the backup commit coordinator attached in a second node.
According to another aspect of the invention all processes of a transaction service are implemented as process pairs having primary and backup processes executing on different nodes having a different network presence.
Other features and advantages of the invention will be apparent in view of the following detailed description and appended drawings.