Computers are used to operate critical applications for millions of people every day. These critical applications may include, for example, maintaining a fair and accurate trading environment for financial markets, monitoring and controlling air traffic, operating military systems, regulating power generation facilities and assuring the proper functioning of life-saving medical devices and machines. Because of the mission-critical nature of applications of this type, it is crucial that their host computers remain operational virtually all of the time.
Despite attempts to minimize failures in these applications, the computer systems still occasionally fail. Hardware or software glitches can retard or completely halt a computer system. When such events occur on typical home or small-office computers, there are rarely life-threatening ramifications. Such is not the case with mission-critical computer systems. Lives can depend upon the constant availability of these systems, and therefore there is very little tolerance for failure.
In an attempt to address this challenge, mission-critical systems often employ redundant hardware or software to guard against catastrophic failures and provide some tolerance for unexpected faults within a computer system. As an example, when one computer fails, another computer, often identical in form and function to the first, is brought on-line to handle the mission critical application while the first is replaced or repaired. Many fault-tolerant systems provide redundant computer subsystems which operate in lockstep, with each executing identical instructions at the same time.
Exemplary fault-tolerant systems are provided by Stratus Technologies International of Maynard, Mass. In particular, Stratus' ftServers provide better than 99.999% availability, being offline only two minutes per year of continuous operation, through the use of parallel hardware and software typically running in lockstep. During lockstep operation, the processing and data management activities are synchronized on multiple computer subsystems within an ftServer. Instructions that run on the processor of one computer subsystem generally execute in parallel on another processor in a second computer subsystem, with neither processor moving to the next instruction until the current instruction has been completed on both. Redundant, fault-tolerant computer systems which employ two subsystems operating in lockstep are referred to as Dual Modular Redundant (DMR), and provide means by which each subsystem may check the operations of the other subsystem. Similarly, fault-tolerant computer systems which employ three subsystems operating in lockstep are referred to as Tri Modular Redundant (TMR), and provide means by which a result is deemed correct if it is obtained independently by two of the three subsystems.
The processing subsystems are typically joined by a bridge, which in turn is linked to a bus. Various Input/Output (I/O) devices are then attached to the bus, and may include disk storage, network communications, graphical interfaces, and so forth. In the event of a failure, the failed subsystem may be brought offline while the remaining subsystem continues executing. The failed subsystem is then repaired or replaced, brought back online, and synchronized with the still-functioning processor. Thereafter, the two systems resume lockstep operation.
Existing systems have also occasionally allowed for an administrator controlled splitting of a DMR or TMR system into two or more simplex subsystems. In this mode of operation, or split mode, each subsystem typically operates independently, with access to its own network, keyboard, display, and other I/O components. While in split mode, these administrators often attempt to upgrade the software, and in particular the Operating System software on each side.