1. Field of the Invention
The invention relates to fault-tolerant computer systems and, in particular, to replica coordinators for such fault-tolerant systems.
2. Description of the Related Art
A need exists for "fault-tolerant" computer systems that can continue to operate despite a failure of one of their components. Examples include inter-bank electronic funds-transfer systems and airline flight reservation systems. To achieve fault-tolerance, these systems typically employ redundancy in the components that are likely to fail, i.e., they employ "replicas" and replace a failed component with a non-faulty replica ("spare"). These systems coordinate the replicas so all replicas are in the same state and, therefore, the spare is able to take over for the failed component. Coordinating these replicas poses a key problem in the design of a fault-tolerant system.
One approach to replica coordination, known as the "state machine approach," calls for each replica to be a deterministic state machine that reads a sequence of commands, each command causing a state transition that is completely determined by the command and the current state of the machine. The state transitions can produce outputs to an environment, e.g. I/O requests. Each replica of the state machine starts in the same state and reads an identical sequence of commands. Each replica, therefore, undergoes an identical sequence of state transitions and produces an identical sequence of outputs. This approach ensures that, when a failure occurs, the spare is in the same state as the failed component at the time of, or just prior to, the failure. The spare can therefore interact with the environment in a manner consistent with past interactions between the now-failed component and the environment.
Fault-tolerant systems mask replica failures by combining the output sequences from multiple replicas into a single output sequence that appears to have come from a single, non-faulty state machine. According to this approach, each replica produces a sequence of outputs, but a replica coordination mechanism allows only one replica's sequence of outputs to reach the environment. For example, accepting outputs from 2t+1 replicas, a "majority voter" can selectively provide the outputs to the environment and thereby mask as many as t faulty replicas. In a "primary/backup" method of replica coordination, all non-faulty replicas perform the same computations, but only one replica (the primary) interacts with the environment. In either case the fault-tolerant system must ensure that each replica reads the same sequence of commands and the environment receives only a single output.
A computer system is composed of layers, as shown in FIG. 1. An Application program makes calls to application support routines (DLLs, RTLs, etc.), which in turn call operating system software. Optionally, the application software can bypass the support routines and call the operating system directly. Lower layers include computer hardware (CPU, memory, bus, network, etc.), and I/O components (disks, user terminals, etc.)
Prior-art systems provide replica coordination by adding a layer or significantly modifying one or more of the existing layers. Each such approach poses problems, however. For example, in hardware-layer replica coordination, the CPU, memory, operating system and application are replicated and the hardware chooses which replica interacts with the environment. This requires no changes to the operating system or application software, but, quite problematically, each new hardware realization requires a separate design and consequentially these systems lag behind the hardware cost/performance curve.
Adding replica coordination to an existing operating system is difficult because a developer must identify state transitions implemented in the operating system. This is difficult because mature operating systems are very complex. Furthermore, modifying an operating system and then maintaining the modified operating system is very costly. Some prior-art systems provide an additional application support layer between the application and the operating system. Problematically, this requires application developers to learn and use a new interface. Furthermore existing applications must be rewritten to be made fault-tolerant. Fault-tolerance can also be built into an application. However, this shifts the problem of replica coordination to the application's developers and the same problems must be solved anew for each application. Furthermore, application developers generally are not acquainted with the complexity and nuances of replica coordination.
Some prior-art systems, such as a so-called "hypervisor" system, impose a heavy performance penalty due to a high frequency with which the replica coordinator must be invoked.