This invention relates generally to fault-tolerant multiple processor systems, and in particular to a technique that permits the system to recover from momentary or very short drops in primary power that may be noticed by fewer than all of the processors.
Fault tolerant computing, evolving as it did from early specialized military and communications systems, is found today in a variety of commercial system designs. Fault tolerant designs seek to provide the advantages of increased system availability and continuous processing together, if possible, with the ability to maintain the integrity of the data being processed. Designs for achieving fault tolerance range from providing sufficient redundancy to reconfigure around failed components to using "hot backups" that sit and wait for a failure of a primary unit before being called into action. Also included in many fault tolerant designs are methods of protecting data in the face of the inevitable: a fault that may bring down the system.
One fault-tolerant design approach, an example of which can be found in U.S. Pat. No. 4,817,091, is a fault-tolerant multiple processor system in which the individual processors, in addition to performing individual and independent tasks, are provided the ability to communicate with one another. Using this communication ability, each processor will periodically broadcast its well-being by sending a message (called an "I'm Alive" message) to all the other processors of the system. The absence of an I'm Alive message from any processor is an indication that the silent processor might have failed and may be unable to recover. When the absence of an expected I'm Alive message is noted by the other processors of the system, they will initiate a "regroup" operation to determine what processors are still present and operating in the system, and to confirm the silent processor is no longer available. The regroup operation involves each processor broadcasting multiple messages telling its companion processors its view of the system (i.e. what processors it sees as still operating). If a processor has failed, and does not participate in the regroup operation, it will be ostracized from further communication in the system so that even if the failed processor at some subsequent time begins to send messages they will be ignored. (Actually, an implementation of this prior art technique does send a reply in the form of a "poison packet" which, in effect, informs the ostracized processor that it has been excluded from the system and that it should shut itself down.) The processes (i.e. programs) running on the failed processor can be taken over by another processor in the system.
Another feature of such fault-tolerant systems is to prevent loss of data, and to provide quick and more complete recovery from unavoidable shut-downs of operation from, for example, loss of operating power, ranging from total loss to momentary loss. Some fault-tolerant systems provide a backup power in the form of batteries in the event primary power is lost to allow the system to maintain memory-stored data. Accordingly, if advance warning is provided of impending power loss, a processor may have time to store its operating state and data before the loss of primary power puts the processor in "hibernation."
During the period a processor is preparing for hibernation, and thereafter restoring its pre-hibernation state, it is not bothering to send the periodic I'm Alive messages. The time taken is greater than that between expected I'm Alive transmissions. This creates the possibility of a problem: if a momentary power drop occurs to cause only one or a few of the system's processors to receive a warning and to go into hibernation, they will cease their I'm Alive broadcasts, causing those processors that did not experience a power warning to regroup and ostracize those that did. Thus, even though all processors of the multiple processing system may be in proper working order, a momentary drop in primary power sensed by less than all the processors of the system can cause a decrease in overall operating availability and/or efficiency of the system.