Vendors of fault-tolerant systems attempt to achieve both increased system availability, continuous processing, and correctness of data even in the presence of faults. Depending upon the particular system architecture, application software (“processes”) running on the system either continue to run despite failures, or the processes are automatically restarted from a recent checkpoint when a fault is encountered. Some fault-tolerant systems are provided with sufficient component redundancy to be able to reconfigure around failed components, but processes running in the failed modules are lost. Vendors of commercial fault-tolerant systems have extended fault tolerance beyond the processors and disks. To make large improvements in reliability, all sources of failure must be addressed, including power supplies, fans and intercomponent links.
In some network architectures, multiple processor systems are designed to continue operation despite the failure of any single hardware component. Each processor system has its own memory that contains a copy of a message-based operating system. Each processor system controls one or more input/output (I/O) interconnect attachments, such as data communication busses. Dual-porting of I/O controllers and devices provides multiple communication paths to each device. External storage to the processor system, such as disk storage, may be mirrored to maintain redundant permanent data storage.
This redundancy is necessary in the network communication paths and components connecting end nodes, such that no path or component is critical to a connection. Typically, this redundancy is realized in the form of multiple switching fabrics on which a processor can communicate with another processor or peripheral component as long as at least one communication path to the other processor or peripheral component along a fabric is available and fully operative.
Also, application software (also referred to as “processes”) may run under the operating system as “process-pairs” including a primary process and a backup process. The primary process runs on one of the multiple processors while the backup process runs on a different processor. The backup process is usually dormant, but periodically updates its state in response to checkpoint messages from the primary process. The content of a checkpoint message can take the form of a complete state update, or one that communicates only the changes from the previous checkpoint message.
To detect processor failures, each processor periodically broadcasts an “IamAlive” message for receipt by all the processors of the system, including itself, informing the other processors that the broadcasting processor is still functioning. When a processor fails, that failure will be announced and identified by the absence of the failed processor's periodic IamAlive message. In response, the operating system will direct the appropriate backup processes to begin primary execution from the last checkpoint. New backup processes may be started in another processor, or the process may run without a backup until the hardware has been repaired.
In addition to providing hardware fault tolerance, the processor pairs of the above-described architecture provide some measure of software fault tolerance. When a processor fails due to a software error, the backup processor frequently is able to successfully continue processing without encountering the same error.
When a time interval passes without receiving an IamAlive message from a given processor, the processor that detects the timeout can assume that the processor has failed, and informs other processors in the system of the fact. The other processors then ignore the content of messages from the failed processor. Ultimately, many or all of the other processors could end up ignoring the affected processor, and the ostracized processor functions outside of the system. This condition is sometimes called the split-brain problem as further described in U.S. Pat. No. 5,991,518, issued Nov. 23, 1999, entitled, “Method and Apparatus for Split-Brain Avoidance in a Multi-Processor System,” naming as inventors Robert L. Jardine, Murali Basavaiah, and Karoor S. Krishnakumar.
Situations such as described in the preceding paragraph can cause both primary and backup processes running in the ostracized processor and in other processors in the system to regard themselves as the primary process, thereby destroying the ability to perform backup functions and possibly corrupting files and system tables. Further, all of the processors in a system can become trapped in infinite loops while contending for common resources. This problem can be avoided by supplementing the IamAlive mechanism with a regroup process as described in U.S. Pat. No. 5,884,018 entitled “Method And Apparatus For Distributed Agreement On Processor Membership In A Multi-Processor System”. The regroup process determines a consensus among each processor's view of the state of all processors in the system, and the state of the connectivity among the processors. The regroup process ensures agreement among all processors on a set of surviving processors that are still communicatively coupled as a system. Conversely, processors that are not part of the surviving group selected by the regroup process cease their operations by halting.