1. Field of the Invention
The present invention relates generally to computer programs that must operate reliably and continuously as they interact with remote computer programs over a communication network, and more particularly to a method for providing fault tolerance by the unification of replication and transaction processing.
2. Description of the Background Art
With the increase in business-to-business interactions over the Internet, many computer systems must provide reliable and continuous operation despite faults. Traditional fault tolerance for computer systems has focused on protection of the processing operations of individual computers against faults. Many business computer systems use transaction processing to protect the data of the computer system against faults. In the event of a fault that prevents a transaction from being completed (e.g., committed), the transaction is aborted and the data are restored to the state at the start of the transaction.
Transaction processing protects the data of the computer system by ensuring that the data are left in a consistent final state, after the transaction commits, or a consistent initial state, after the transaction aborts, but not in an inconsistent partially processed state. In the event of a fault, all of the processing that the transaction performs on the data is lost and the client that initiated the transaction can retry the transaction.
Transactions work well for computer systems that act as servers to human clients. The human client can understand that a transaction has been aborted, that the processing has been lost, and that the transaction must be retried from the start. The human client can also understand that, when the transaction is retried, the results of the transaction might be different from the results that would have been obtained when the transaction was first attempted.
However, transactions are less effective when two computers, within different enterprises or within different divisions of the same enterprise, interact with each other over a communication network, such as the Internet or a virtual private network. The computer that is acting as the client does not have the intelligence of a human client. It is difficult to program the client computer to act appropriately when a transaction is aborted by the server, and it is difficult to program the client computer to handle any differences that might result between its first attempt to use the server and its retry after an abort.
In theory, it is possible to include the client computer of one enterprise and the server computer of another enterprise in a single distributed transaction. The transaction is initiated by the client computer, which acts as the coordinator of the transaction. If the client fails, or if communication between the two computers is lost at a critical moment during the committing of the transaction, the server hangs until the client is recovered. Consequently, in practice, distributed transactions are not used.
In the current state of the art, a cluster of computers can participate in a transaction, communicating over a network. However, the transaction is under the control of a central transaction coordinator. Similarly, the technology exists to allow several copies of a database to coexist, but a central controller must designate one of those copies as the primary copy and the other copies as backup copies. Technology that allows several computers to participate in a transaction, potentially over a network, also exists. In such a configuration, there is no central coordinator; rather, any processor can act as coordinator for the transactions that it initiates. However, there is only one coordinator that manages each transaction. Consequently, in an activity that spans the computers of several enterprises, each enterprise must allow transactions within its computers to be managed by a coordinator on a computer of another enterprise. Most enterprises would not permit other enterprises to use their computers in such a manner.
Therefore, a need exists for a fault tolerance technology that avoids the use of a central controller that controls activities that span the computers of several enterprises. The present invention satisfies that need, as well as others, and overcomes deficiencies in the current state of the art of fault tolerance technology.