1. Field of the Invention
This invention generally relates to replicated state machines, and more specifically, to error recovery in replicated state machines. Even more specifically, the preferred embodiment of the invention relates to containment and recovery of software exceptions in interacting, replicated-state-machine-based, fault-tolerant components.
2. Background Art
Replicated state machines can be used to provide fault-tolerant services as described in F. B. Schneider, Implementing Fault-tolerant Services using the State Machine Approach: A Tutorial, ACM Computing Surveys, 22(4), December 1990, pp. 299-319. The above-mentioned reference defines distributed software as often being structured in terms of clients and services. Each service includes one or more servers and exports operations that clients invoke by making requests. Using a single centralized server is the simplest way to implement a service; however, the resulting service can only be as fault-tolerant as the processor executing that server. Multiple servers that fail independently can be used to provide a fault-tolerance service. Such is done by replicating the single server and executing the replicas on separate processors of a distributed processing system.
The state machine approach refers to a method of implementing a fault-tolerant service by replicating servers and coordinating client interactions with the server replicas. With the replicated state machine approach, the service is expressed as a deterministic state machine and copies of the state machine are executed in a number of different failure domains in parallel. For example, the copies of the state machine may be executed on several different computers in parallel. Clients express their requests in terms of state machine stimuli that are committed to a sequence of such stimuli using a distributed consensus protocol. An example of a distributed consensus protocol is the PAXOS protocol as described in L. Lamport, the part-time parliament, Technical Report 49, DEC SRC, Palo Alto, 1989.
The distributed consensus protocol ensures that all state machine replicas receive the same sequence of stimuli and since, by design, they all start off with the same state and are deterministic, the state machines continue to execute as replicas of one another indefinitely. Fault-tolerance is essentially achieved because each replica holds one copy of the state of the service so it does not matter if a subset of the replicas fail since a copy of the service state will be retained in a surviving replica.
The exact number of survivable failures and the type of failure that is survivable (fail-stop or Byzantine) are functions of the choice of distributed consensus protocol.
A hardware failure can be recovered and the system returned to normal operating condition by copying a snapshot of the state of a surviving replica to the replaced or repaired node and including it back into the distribution of the input sequence at the point in the sequence corresponding to the snapshot of the state that was restored to the node.
In general, it is also necessary to restore availability after simultaneous power loss to all nodes. Power failure is a special kind of failure because data committed by a node to stable storage is expected to be preserved across the power outage and can be used for recovery. The stable storage makes it possible to restore availability when power is restored, even if the power failure affected all nodes simultaneously.
The messages that make up the sequence of inputs to the software process are generally passed through the distributed consensus protocol in batches for efficiency. The replicas cannot actually execute in stable storage but it is possible to run an input batch through each replica in two phases with an intermediate commit such that if power fails before the commit, then when power is restored the replica is rolled back to the previous input batch boundary and the input batch is retried, or if power fails after the commit the replica is rolled forwards to the committed state and the input batch is discarded. This mechanism makes use of the stable storage to store a snapshot of the replica state at the commit boundary. For correct interaction with the world outside the replica, there is a requirement that the state is never rolled back after a response (an output from the state machine to the outside world) is made. This requirement may be satisfied in one of two ways: responses may be blocked during the first phase and allowed to proceed when the input batch is repeated on a second copy of the state for the second phase or alternatively, responses may be buffered during the first phase and released in the second phase after the commit.
The replicated state machine approach solves the problem of maintaining availability across hardware failures but it does not solve the problem of maintaining availability across failure in the software due to programming errors. In fact, all replicas will encounter a programming error at approximately the same time and all will fail approximately simultaneously. The San Volume Controller product of the International Business Machines Corporation (IBM) uses the replicated state machine approach to implement a central core of configuration and control code, which coordinates the behavior of a cluster of agents offering storage services. SVC solves the problem of software bugs in the replicated core using a mechanism called cluster recovery.
Software errors can only be recovered if they are detected so, as part of a general fail-fast approach, the SVC implementation makes use of ‘assert’ statements and the failure of an assertion causes a software exception.
SVC's cluster recovery mechanism works generally as follows. An exception in the replicated code is detected. All agents are stopped. The two-phase commit implementation is used to roll back each replica state to the previous input batch commit point. Subsequent input batches (including the one that would cause the exception if replayed) are flushed from the system. The replica state is reduced to a canonical form by discarding transient state associated with the (now reset) dynamic system operation and preserving any configuration state required to restart system operation. Communication between the replicated core and the cluster of agents is reset (to reflect that the agent-to-core messages in the flushed input batches have been lost). The system resumes execution in the same way it would ordinarily if power had failed and then been restored to all nodes simultaneously by restarting the agents and resuming the storage services.
Although there is no guarantee of successful recovery, in practice this mechanism generally works because the software exception was generally caused by an unusual combination of input-message and dynamic-state-of-the-core which forced the software down an unusual path that had never been tested with a specific set of parameters. When the system is resumed after cluster recovery, the input and the dynamic state has been discarded so the problem does not immediately reoccur and availability can be restored while the problem is debugged and a fix issued.
The drawback of this solution is that failure in any component in the replicated core is promoted to a cluster recovery event, which results in a temporary loss of availability of all storage services, including those not responsible for the software exception.
A solution is required which allows the software exception to be contained within the component responsible for the problem and recovered with minimal impact to the availability of other services.
A very difficult aspect of this problem is that the services in the fault-tolerant core are generally interrelated and may call on each other to perform actions which result in complex changes to the replicated state. With the SVC cluster recovery solution, the reset of the replicated state to canonical form is relatively simple because_all_ of the state is reset.
When the goal is to contain the failure, it is not possible to reset_all_ the state since some of the state is clearly required for the ongoing dynamic operation of the components that must be protected from the contained failure. Furthermore, since an exception can happen at any time, it is possible that a failing component is halfway through a series of requests to other components and the dynamic state of the other components would be left in an incompletely modified, inconsistent condition even if the state of the failing component was itself reset.
The generally accepted solution to this kind of problem is to use transactions which allow components to group together requests that they make of other components such that they are guaranteed that if they die, the set of changes will either all be committed or all be rolled out.
The significant disadvantage of transactions is that they complicate the APIs between components and therefore complicate the implementation.
Software transactional memory is a different existing technique which is used for concurrency control and eliminates the API complexity of transactions (which are used for concurrency control in addition to their use for restoring consistency after a software crash).
The Erlang programming language uses a concept called hierarchical supervision whereby a parent process watches its children to see if they encounter an exception. The parent process is responsible for performing an appropriate recovery action such as restarting the child process or generating an exception itself to force recovery at a higher level. Erlang is used in conjunction with the mnesia database that has a transactional API.
N-version programming is an approach that may be used in conjunction with a distributed consensus protocol to solve the problem of availability across software failures. This relies on the software bugs in multiple different implementations of the same software function being different and there being a quorum of correct implementations when a bug is encountered.