The invention relates to maintaining synchronized execution by loosely-coupled processors in fault resilient, fault tolerant and disaster tolerant computer systems.
Computer systems that are capable of surviving "faults," or hardware failures, generally fall into three categories: fault resilient, fault tolerant, and disaster tolerant. Fault resilient computer systems can continue to function, often in a reduced capacity, in the presence of hardware failures. These systems operate in either an availability mode or an integrity mode, but not both. A system is "available" when a hardware failure does not cause unacceptable delays in user access. Accordingly, a system operating in an availability mode is configured to remain online, if possible, when faced with a hardware error. A system has data integrity when a hardware failure causes no data loss or corruption. Accordingly, a system operating in an integrity mode is configured to avoid data loss or corruption, even if the system must go offline to do so.
Fault tolerant systems stress both availability and integrity. A fault tolerant system remains available and retains data integrity when faced with a single hardware failure, and, under some circumstances, when faced with multiple hardware failures.
Disaster tolerant systems go beyond fault tolerant systems and require that loss of a computing site due to a natural or man-made disaster will not interrupt system availability or corrupt or lose data.
For all three cases, to manage a failure of a component, there must be an alternative component which continues to function in the presence of the failure. Thus, redundancy of components is a fundamental prerequisite for a disaster tolerant, fault tolerant or fault resilient system that recovers from or masks failures. Redundancy can be provided through passive redundancy or active redundancy, each of which has different consequences.
A passively redundant system provides access to alternative components that are not associated with the current task and must be either activated or modified in some way to account for a failed component. The consequent transition may cause a significant interruption of service. Subsequent system performance also may be degraded. Examples of passively redundant systems include stand-by servers and clustered systems. The mechanism for handling a failure in a passively redundant system is to "fail-over", or switch control, to an alternative server. The current state of the failed application may be lost, and the application may need to be restarted in the other system. The fail-over and restart processes may cause some interruption or delay in service to the users. Despite any such delay, passively redundant systems such as stand-by servers and clusters provide "high availability" and do not deliver the continuous processing usually associated with "fault tolerance."
An actively redundant system provides an alternative processor that concurrently processes the same task and, in the presence of a failure, provides continuous service. The mechanism for handling failures is to compute through a failure on the remaining processor. Because at least two processors are looking at and manipulating the same data at the same time, the failure of any single component should be invisible both to the application and to the user.
Failures in systems can be managed in two different ways that each provide a different level of availability and different restoration processes. The first is to recover from failures, as in passively redundant systems, and the second is to mask failures so they are invisible to the user, as in actively redundant systems.
Systems that recover from failures employ a single system to run user applications until a failure occurs. Once a failure is detected, which may be several seconds to several minutes after the failure occurs, either by a user, a system operator or a second system that is monitoring the status of the first, the recovery process begins. In the simplest type of recovery system, the system operator physically moves the disks from the failed system to a second system and boots the second system. In more sophisticated systems, the second system, which has knowledge of the applications and users running on the failed system, and a copy of or access to the users' data, automatically reboots the applications and gives the users access. In both cases, the users see a pause in operation and lose the results of any work from the last save to the time of the failure. Systems that recover from failures may include an automatic backup feature, where selected files are copied periodically onto another system which can be rebooted if the first system fails; standby servers that copy files from one system to another and keep track of applications and users; and clusters, such as a performance scaling array of computers with a fault tolerant storage server and a distributed lock manager.
Systems that mask failures employ the concept of parallel components. At least two components are deployed to do the same job at the same time. If one should fail, the other continues. This improves overall system reliability. An example of a simple and common parallel technique places two power supplies in one system. If one power supply fails, the other keeps the system operating. More robust masking systems replicate everything in the system to make failures transparent to users for all single failures. These fault tolerant systems may detect failures in less than a second and may offer other features that facilitate constant operation, such as on-line repair and upgrade capabilities.
To provide fault tolerance, a system must uniquely identify any single error or failure, and, having identified the error or failure, must isolate the failed component in a way that permits the system to continue to operate correctly. Identification and isolation must take place in a short time to maximize continuous system availability. In addition, a redundant system must be repairable while the system continues to function, and without disrupting the applications running on the system. Finally, once repaired, the system should be able to be brought back to full functionality with minimal interruption of a user's work. Systems that do not acceptably accomplish one or more of these steps may be unable to provide continuous operation in the event of a failure.
Previous fault tolerant systems have used tightly coupled, synchronized hardware with strong support from the systems' operating system and the applications to deal with fault handling and recovery. In general, commercial fault tolerant systems use at least two processors and custom hardware in a "fail-stop" configuration as the basic building block. A typical fail-stop system runs two processors in cycle-to-cycle lockstep and uses hardware comparison logic to detect a disagreement in the outputs of the two systems. As long as the two processors agree, operation is allowed to continue. When the outputs disagree (i.e., a failure occurs), the system is stopped. Because they are operated in cycle-to-cycle lockstep, the processors are said to be "tightly coupled".
One example of a fail-stop system is a pair and spare system in which two pairs of processors running in clock cycle lockstep are configured so that each pair backs up the other pair. In each pair, the two processors are constantly monitored by special error detection logic and are stopped if an error or failure is detected, which leaves the other pair to continue execution. Each pair of processors also is connected to an I/O subsystem and a common memory system that uses error correction to mask memory failures. Thus, two processors, memory and an I/O subsystem reside in each half of the pair and spare system. The operating system software provides error handling, recovery and resynchronization support after repair.
Triple modular redundancy is another method for providing fault tolerance. In a triple modular redundant system, the results of simultaneous execution by three processors are passed through a voter and the majority result is the one used by the system. As the voter is the weak point in these systems, special attention is paid to making the voter fast and extremely reliable or multiple voters are used. The voter can be thought of as an extension of the output comparison logic in the pair and spare architecture. In general, the operating system software accounts for the voter in normal operation, as well as in recovery and resynchronization.