In data processing systems, access and updates to system resources are typically carried out by the execution of discrete transactions (or units of work). A transaction is a sequence of coordinated operations on system resources such that either all of the changes take effect or none of them does. These operations are typically changes made to data held in storage in the transaction processing system; system resources include databases, data tables, files, data records and so on. This characteristic of a transaction being accomplished as a whole or not at all is also known as atomicity.
In this way, resources are prevented from being made inconsistent from each other. If one of the set of update operations fails then the others must also not take effect. A unit of work then transforms a consistent state of resources into another consistent state, without necessarily preserving consistency at all intermediate points.
The atomic nature of transactions is maintained by means of a transaction synchronization procedure commonly called the commit procedure. Logical points of consistency at which resource changes are synchronized within transaction execution are called commit points or syncpoints. An application ends a unit of work by declaring a syncpoint, or by the application terminating.
Atomicity of a transaction is achieved by resource updates made within the transaction being held in-doubt (uncommitted) until a syncpoint is declared at completion of the transaction. If the transaction succeeds, the results of the transaction are made permanent (committed); if the transaction fails, all effects of the unsuccessful transaction are removed (backed out). That is, the resource updates are made permanent and visible to applications other than the one which performed the updates only on successful completion. For the duration of each unit of work, all updated resources must then be locked to prevent further update access. On the contrary, when a transaction backs out (or rolls back), the resources are restored to the consistent state which existed before the transaction began.
There are a number of different transaction processing systems commercially available; an example of an on-line transaction processing system is the CICS system developed by International Business Machines Corporation (IBM is a registered trademark and CICS is a trademark of International Business Machines Corporation).
In a transaction data processing system which includes either a single node where transaction operations are executed or which permits such operations to be executed at only one node during any transaction, atomicity is enforced by a single-phase synchronization operation. In this regard, when the transaction is completed, the node, in a single phase, either commits to make the changes permanent or backs out.
In distributed systems encompassing a multiplicity of nodes, a transaction may cause changes to be made to more than one of such nodes. In such a system, atomicity can be guaranteed only if all of the nodes involved in the transaction agree on its outcome. A simple example is a financial application to carry out a funds transfer from one account to another account in a different bank, thus involving two basic operations to critical resources: the debit of one account and the credit of the other. It is important to ensure that either both or neither of these operations take effect.
Distributed systems typically use a transaction synchronization procedure called two-phase commit (2PC) protocol to guarantee atomicity. In this regard, assume that a transaction ends successfully at an execution node and that all node resource managers (or agents) are requested to commit operations involved in the transaction. In the first phase of the protocol (prepare phase), all involved agents are requested to prepare to commit. In response, the agents individually decide, based upon local conditions, whether to commit or back out their operations. The decisions are communicated to a synchronization location, called the coordinator, where the votes are counted. In the second phase (commit phase), if all agents vote to commit, a request to commit is issued, in response to which all of the agents commit their operations. On the other hand, if any agent votes to back out its operation, all agents are instructed to back out their operations. In a large system with a high volume of transactions, the two phase commit process may arrange the agents in a tree like manner in which one of a subset of agents acts as a middleman to coordinate the votes of the subset and send a combined vote to the main coordinator.
Distributed systems are organized in order to be largely recoverable from system failures, either communication failures or node failures. A communication failure and a failure in a remote node generally manifest themselves by the cessation of messages to one or more nodes. Each node affected by the failure can detect it by various mechanisms, including a timer in the node which detects when a unit of work has been active for longer than a preset maximum time. A node failure is typically due to a software failure requiring restarting of the node or a deadlock involving pre-emption of the transaction running on the node.
System failures are managed by a recovery procedure requiring resynchronization of the nodes involved in the unit of work. Since a node failure normally results in the loss of information in volatile storage, any node that becomes involved in a unit of work must write state changes (checkpoints) to non-volatile storage synchronously with the transmission of messages during the two-phase commit protocol. These checkpoint data (or log messages) are written to a stable storage medium as the protocol proceeds to allow the same protocol to be restarted from a consistent state in the case of a failure of the node. This is known as resynchronization.
U.S. Pat. No. 5,311,773 describes how a commit procedure can be resynchronized asynchronously after a failure while allowing an initiating application to proceed with other tasks. It does not, however, address the problem of interruption of communication to multiple partner nodes involved in a distributed unit of work.
The IBM System Network Architecture or IBM SNA LU 6.2 syncpoint architecture developed by International Business Machines Corporation is known to coordinate commits between two or more protected resources. The LU 6.2 architecture supports a syncpoint manager (SPM) which is responsible for resource coordination, syncpoint logging and recovery. A description of the communication protocol used in this architecture is found in the book "SNA Peer Protocols for LU6.2" (ref. SC31-6868-1, IBM Corporation).
A problem with known protocols for two-phase commit across networks is that they do not cater adequately for the case where contact with the coordinator of the unit of work is lost. In such cases, it is not possible to immediately tell other partners of the distributed unit of work what the outcome is. The decision is only known later when contact is made with the coordinator.
If contact is lost, partners can be kept waiting forever until contact is made again. Each of the partners may hold resource locks and keep application code and end users waiting for a long time. Operator action is then required to release locks, applications and end user screens.
A known solution to this problem is to break the communication with the partners and to enter a timed retry loop between all partners. There are some drawbacks with this prior art approach. Retry loops are very inefficient, particularly in the case where there are either many agents issuing them or they are done frequently. In addition, operational problems can arise due to the breakages of communications.
A delay in the resolution of the unit of work outcome is produced, dependent on the timing of the retry loop, causing a considerable reduction in concurrency of resource update processing (particularly if many resources are involved). At the restart of a node, there may be many resynchronization tasks which can overload the system; if many communicating nodes are restarted simultaneously, deadlock can occur.