1. Field of the Invention
The present invention relates to computer network systems. More particularly, it relates to high availability on-line transaction processing systems.
2. The Prior Art
Alsberg and Day [1976] introduced the idea of "System Pairs" to protect against simultaneous faults through duplication and geographical separation of on-line systems. They reasoned that weather, power failures, computer operators and sabotage were unlikely to fault both systems at the same time. A fundamental problem with System Pairs is that when the backup system stops receiving data from the primary system, there is no automatic way to determine if the primary system failed, or if the circuit(s) connecting the primary and backup systems failed or both.
Gray and Reuter (Transaction Processing: Concepts and Techniques, 1993) proposed that the computer operators at the primary and backup sites consult one another to determine if the primary system failed. If it did, the operator at the backup site instructs the backup to takeover the role of the primary system. While the operators sort out the problem, the system may be down. If the operators cannot communicate for any reason, for example, fire, flood or earthquake, the system may be down for a longer period of time.
Digital Equipment Corp. markets a "Disaster-Tolerant System," which requires computer operator action or an additional computer at a third site to cast the deciding "vote" if the primary and backup systems cannot communicate. In addition, the primary and backup sites must be located within 24.8 miles of each other. This does not provide enough geographical separation to protect both sites against the effects of an earthquake, hurricane or flood.
Another important issue effecting availability is how the client computers communicate with the primary and backup systems. The system shown by Gray and Reuter depicts the client computers with separate connections to the primary and backup systems. The client computer sends all its messages to the primary system while it's operable. If the primary system fails, the backup system takes over and the client communicates with the backup system via its connection to the backup system. Unfortunately, if the circuit to the primary system fails, and the primary system is operable, the client is denied service.