Distributed, shared-nothing multi-processor architectures and fault-tolerant software using process pairs require that all processors in a system have a consistent image of the processors making up the system. (The NONSTOP.RTM. KERNEL operating system (NONSTOP.RTM. is a registered trademark and NONSTOP.RTM. KERNEL is a trademark of Tandem Computers Incorporated), available from the assignee of this application is an example of such fault-tolerant software.) This consistent system image is crucial for maintaining global system tables required for system operation and for preventing data corruption caused by, say, an input/output process pair (IOP) of primary and backup processes on different processors accessing the same I/O device through dual-ported I/O controllers or a shared bus (such as SCSI).
Detection of processor failures occurs quickly with an IamAlive message scheme. Each processor periodically sends IamAlive packets to each of the other processors in the system. Each processor in a system determines whether another processor is operational by timing packets from it. When the time interval passes without receipt of a packet from a given processor, the first processor decides that the second might have failed.
In older systems, before regrouping was implemented, the following could occur when the second processor then sent a packet to the first. The first processor judged the second to be functioning improperly and responded with a poison packet. The first processor ignored the content of the packet from the second.
Ultimately, many or all of the other processors could end up ignoring the affected processor (except to try to stop it). The affected processor was, in effect, outside of the system and functioning as if it were an independent system. This condition was sometimes called the split-brain problem.
Without regrouping, the following situations can occur: Both of the processes in a process pair running on different processors can regard themselves as the primary, destroying the ability to perform backup functions and possibly corrupting files. All system processors can become trapped in infinite loops, contending for common resources. System tables can become corrupted.
Regrouping supplements the IamAlive/poison packet method. Regrouping uses a voting algorithm to determine the true state of each processor in the system. Each processor volunteers its record of the state of all other processors, compares its record with records from other processors and updates its record accordingly. When the voting is complete, all processors have the same record of the system's state. The processors will have coordinated among themselves to reintegrate functional but previously isolated processors and to correctly identify and isolate nonfunctional processors.
Regrouping works only when physical communication among processors remains possible, regardless of the logical state of the processors. If a processor loses all of its communications paths with other processors, that processor cannot be regrouped. It remains isolated until communications are restored and the system is cold loaded. (Such a processor usually stops itself because its self-checking code cannot send and receive message system packets to and from itself.)
A processor's logical state and its condition are distinguished. A processor has two logical states in a properly configured system: up or down. However, a processor has three conditions: dead, which is the same as the down logical state; healthy, which is the same as the up logical state; and malatose, which is described further below.
A processor is dead if it does not communicate with the rest of the system. Dead processors include those, for example, that execute a HALT or a system freeze instruction, that encounter low-level self-check errors such as internal register parity errors, that execute infinite loops with all interrupts disabled, that execute non-terminating instructions due to data corruption or that are in a reset state.
Dead processors are harmless, but the regrouping algorithm removes them from the system configuration. Other processors detect dead processors and declare them down.
A processor is healthy if it is running its operating system (preferably, the NONSTOP.RTM. KERNEL operating system available from the assignee of the instant application) and can exchange packets with other processors (preferably, over a redundant high-speed bus or switching fabric) within a reasonable time. The regrouping algorithm prevents a processor declaring down a healthy processor.
A malatose processor is neither dead nor healthy. Such a processor either is not responding in a timely manner (perhaps because of missing timer ticks) or is temporarily frozen in some low-level activity. A malatose processor might be, for example, flooded with highest-priority interrupts such that the processor cannot take lower-priority interrupts or might be flooded with lower-priority interrupts such that the processor falls behind in issuing IamAlive packets. A malatose processor might be waiting for a faulty hardware device on which the clocks have stopped or might be running too long with interrupts disabled by the mutual exclusion mechanism.
The regrouping algorithm detects a malatose processor and forces it to become either healthy or dead, that is to say, either up or down. Correspondingly, a processor halts itself when another processor that it has not declared down declares it down.
With regard to regrouping, each processor in the system is either stable (that is, waiting for the need to act) or perturbed, including several states described below.
While a processor is stable, the IamAlive message scheme continues to operate. If a predetermined amount of time, say, 2.4 seconds, passes without an IamAlive message from another processor, the processor becomes perturbed.
While perturbed, a processor exchanges specially marked packets with other perturbed processors to determine the current processor configuration of the system. When that configuration is agreed upon, the processor becomes stable again.
Processors spend most of their time stable.
A regrouping incident begins when a processor becomes perturbed and ends when all processors become stable again. Each regrouping incident has a sequence number that is the number of regrouping incidents since the last system cold load.
Each processor also maintains variables to store two configurations, one old and one new. While a processor is stable, bit-map variables called OUTER.sub.-- SCREEN and INNER.sub.-- SCREEN both contain the old configuration.
While a processor is stable, it knows that every processor in the old configuration is up and every processor not in the old configuration is down. Each processor in the old configuration has the same regrouping sequence number.
While a processor is perturbed, it broadcasts its view of the configuration (and its own status) on its busses or fabrics. It sends this view periodically, for example, every 0.3 seconds, to all other processors in the old configuration. Receiving such a broadcast perturbs any stable processor in the configuration.
The four stages of the regrouping protocol described further below make all perturbed processors create the same view of the system configuration. When regrouping completes, all processors in the system are stable and contain the same new configuration. Also, every processor in the new configuration has the same regroup sequence number that is greater than the number in the old configuration.
The new configuration contains no processor that was not in the old configuration. All processors that remained healthy throughout the incident are in the new configuration.
Any processor that was dead when the incident began or that became dead during the incident is not in the new configuration. Regrouping restarts if a processor becomes dead during an incident.
Correspondingly, processors that were malatose when the incident began are in the new configuration as healthy processors if they participated in the complete incident.
The regrouping method ensures that all processors in the new configuration have included and excluded the same processors.
Processor Stages of Pre-Existing Regroup
Each processor regrouping according to the preexisting algorithm maintains an EVENT.sub.-- HANDLER () procedure and a data structure herein termed the regroup control template #.sub.-- 700 shown in FIG. 7. A variable herein termed SEQUENCE.sub.-- NUMBER contains the current regroup sequence number.
Each processor passes through the following stages while running: Stage 0, Stage 5 and Stages 1 through 4. Stage 0 is a special stage defined in the process control block at system generation. Stage 5 is the stable state described above. Stages 1 through 4 together make up the perturbed state also described above.
A processor maintains the current stage in the variable STAGE. Also, the processor maintains the variables KNOWN.sub.-- STAGE.sub.-- 1 through KNOWN.sub.-- STAGE.sub.-- 4 for each of Stages 1 through 4, respectively. Each of these variables is a bit mask that records the processor numbers of all processors known to the maintaining processor to be participating in a regroup incident in the stage corresponding to the variable.
A processor enters Stage 0 when it is cold loaded. While it is in Stage 0, the processor does not participate in any regrouping incident. Any attempt to perturb the processor in this state halts the processor. The processor remains in Stage 0 until its integration into the inter-process and inter-processor message system is complete. Then the processor enters Stage 5. FIGS. 8A and 8B summarize subsequent actions.
A regrouping incident normally begins when a processor fails to send an IamAlive packet in time, step #.sub.-- 810. This failure perturbs the processor that detects the failure.
When a processor is perturbed, step #.sub.-- 805, it enters Stage 1. Stage 1 synchronizes all participating processors as part of the same regrouping incident, step #.sub.-- 830. Because a new incident can start before an older one is finished, a method is needed to ensure that the participating processors process only the latest incident.
FIG. 9 summarizes the transition from Stage 5 to Stage 1. The processor increments the SEQUENCE.sub.-- NUMBER #.sub.-- 710, sets the Stage #.sub.-- 720 to 1, sets the KNOWN.sub.-- STAGE.sub.-- n variables to zero, and then sets its own bit in KNOWN.sub.-- STAGE.sub.-- 1 #.sub.-- 750a to 1. (The processor does not yet know which processors other than itself are healthy.)
The message system awakens the processor periodically, every 0.3 seconds in one embodiment, so the processor can make three to six attempts to receive acceptable input. More than three attempts occur if more than one processor in the old configuration remains unrecognized, if a power up has occurred, or if the algorithm was restarted as a new incident.
When awakened, the processor broadcasts its status to the old configuration of processors, step #.sub.-- 830. Its status includes its regroup control template #.sub.-- 700.
Typically, status packets from other perturbed processors eventually arrive. If a packet arrives from a processor that was not in the old configuration as defined by the OUTER.sub.-- SCREEN #.sub.-- 730, this processor ignores the packet and responds with a poison packet.
For a packet that it does not ignore, the processor compares the sequence number in the packet with the SEQUENCE.sub.-- NUMBER #.sub.-- 710. If the packet sequence number is lower, then the sender is not participating in the current incident. Other data in the packet is not current and is ignored. The processor sends a new status packet to that processor to synchronize it to make it participate in the current incident.
If the sequence number in the packet is higher than the SEQUENCE.sub.-- NUMBER #.sub.-- 710, then a new incident has started. The SEQUENCE.sub.-- NUMBER #.sub.-- 710 is set to the sequence number in the packet. The processor reinitializes its data structures and accepts the rest of the packet data.
If the sequence number in the packet is the same as the SEQUENCE.sub.-- NUMBER #.sub.-- 710, then the processor simply accepts the packet data. Accepting the data consists of logically ORing the KNOWN.sub.-- STAGE.sub.-- n fields in the packet with the corresponding processor variables #.sub.-- 750 to merge the two processors' knowledge into one configuration.
Stage 1 ends in either of two ways. First, all procesors account for themselves. That is to say, when a processor notices that its KNOWN.sub.-- STAGE.sub.-- 1 variable #.sub.-- 750a includes all processors previously known (that is, equals the OUTER.sub.-- SCREEN #.sub.-- 730), then the processor goes to Stage 2. However, in the event of processor failure(s), the processors never all account for themselves. Therefore, Stage 1 ends on a time out. The time limit is different for cautious and non-cautious modes, but the processor proceeds to Stage 2 when that time expires--whether all processors have accounted for themselves or not.
FIG. 10 summarizes the transition from the beginning of Stage 1 to the end of Stage 1. At the end of Stage 1, KNOWN.sub.-- STAGE.sub.-- 1 #.sub.-- 750a identifies those processors that this processor recognizes as valid processors with which to communicate during the current incident. In later stages, the processor accepts packets only from recognized processors.
Stage 2 builds the new configuration by adding to the set of processors recognized by the processor all of those processors recognized by recognized processors, step #.sub.-- 850. In effect, the new configuration is a consensus among communicating peers.
FIG. 11 summarizes conditions at the beginning of Stage 2. The processor sets the Stage #.sub.-- 720 to 2, records its status in KNOWN.sub.-- STAGE.sub.-- 2, and copies KNOWN.sub.-- STAGE.sub.-- 1 to the INNER.sub.-- SCREEN #.sub.-- 740. The processor continues checking for input and broadcasting status periodically, testing incoming packets for acceptance against the OUTER.sub.-- SCREEN and INNER.sub.-- SCREEN #.sub.-- 730, #.sub.-- 740, step #.sub.-- 850.
Packets from old-configuration processors that did not participate in Stage I are identified by the INNER.sub.-- SCREEN #.sub.-- 740 and ignored. Packets from recognized processors are accepted, and their configuration data is merged into the KNOWN.sub.-- STAGE.sub.-- n variables. When a packet from a recognized processor identifies a previously unrecognized processor, the new processor is also added to the INNER.sub.-- SCREEN #.sub.-- 740. Malatose processors that may have been too slow to join the current regroup incident in Stage 1 can thus still join in Stage 2.
When KNOWN.sub.-- STAGE.sub.-- 2 #.sub.-- 750b becomes equal to KNOWN.sub.-- STAGE.sub.-- 1 #.sub.-- 750a, no further changes to the configuration can occur. FIG. 12 summarizes conditions at the end of Stage 2. Stage 3 now begins.
At the beginning of Stage 3, as shown in FIG. 13, the processor increments the Stage #.sub.-- 720 and copies the new configuration to both the INNER.sub.-- SCREEN and the OUTER.sub.-- SCREEN #.sub.-- 740, #.sub.-- 730. A malatose processor can no longer join the new configuration as a healthy processor.
Message-system cleanup, step #.sub.-- 860, is performed as follows: The processors in the new configuration shut off the message system to any processor not in the new configuration. They discard any outstanding transmissions to any excluded processor and discard any incoming transmissions from it. Inter-processor traffic queues are searched for messages queued from requesters/linkers in the excluded processor but not canceled. Any uncanceled messages found are discarded. Inter-processor traffic queues are searched for messages queued from servers/listeners in the excluded processor but not canceled. Any uncanceled messages found are attached to a deferred cancellation queue for processing during Stage 4.
This cleanup ensures that no message exchanges begun by a server/listener application in a processor in the new configuration remain unresolved because of exclusion of the other processor from the new configuration. All messages that could be sent to the excluded processor have been sent; and all messages that could be received from it have been received.
Most processor functions occur as bus or timer interrupt handler actions. Because some cleanup activities take a long time, they cannot be done with interrupts disabled. Instead, those activities are separated from others for the same stage and deferred.
The deferred cleanup is done through a message-system SEND.sub.-- QUEUED.sub.-- MESSAGES procedure that is invoked by the dispatcher (the process scheduler). The deferred activities are then performed with interrupts other than the dispatcher interrupt enabled most of the time.
Periodic checking for input and the broadcasting of status continues. When the deferred cleanup mentioned earlier finishes, the processor records its status in KNOWN.sub.-- STAGE.sub.-- 3 #.sub.-- 750c.
Packets that make it past the INNER.sub.-- SCREEN and the OUTER.sub.-- SCREEN #.sub.-- 740, #.sub.-- 730 are merged into the KNOWN.sub.-- STAGE.sub.-- n variables #.sub.-- 750. When KNOWN.sub.-- STAGE.sub.-- 3 #.sub.-- 750c equals KNOWN.sub.-- STAGE.sub.-- 2 #.sub.-- 750b, all processors in the new configuration have completed similar cleanup and are all in Stage 3. FIG. 14 summarizes conditions at the end of Stage 3.
In Stage 4, the processor completes the cleanup actions of Stage 3 and notifies processes that one or more processor failures have occurred, step #.sub.-- 870. The processor increments the Stage #.sub.-- 720 to 4 and does the following: sets processor-status variables to show excluded processors in the down state; changes the locker processor, if necessary, for use in the GLUP protocol as described herein; processes messages deferred from Stage 3; manipulates I/O controller tables when necessary to acquire ownership; and notifies requesters/linkers.
Stage 4 is the first point at which failure of another processor can be known by message-system users in the current processor. This delay prevents other processes from beginning activities that might produce incorrect results because of uncanceled message exchanges with the failed processor.
The regrouping processor continues to check for input and to broadcast status, step #.sub.-- 870. When the deferred cleanup finishes, the processor records its status in KNOWN.sub.-- STAGE.sub.-- 4 #.sub.-- 750d. FIG. 15 shows this action.
Packets that make it past the INNER.sub.-- SCREEN and the OUTER SCREEN #.sub.-- 740, #.sub.-- 730 are merged into the KNOWN.sub.-- STAGE.sub.-- n variables #.sub.-- 750. When KNOWN.sub.-- STAGE.sub.-- 4 #.sub.-- 750d equals KNOWN.sub.-- STAGE.sub.-- 3 #.sub.-- 750c, all processors in the new configuration have completed similar cleanup and are all in Stage 4. FIG. 16 summarizes conditions at the end of Stage 4.
At the beginning of Stage 5, the Stage #.sub.-- 720 becomes 5. One final broadcast and update occur. The OUTER.sub.-- SCREEN #.sub.-- 730 contains what has now become the old configuration for the next regrouping incident. FIG. 17 shows this situation.
Finally, higher-level operating system cleanup can now begin. Global update recovery starts in the locker processor.
The processor does its own cleanup processing. Attempts to restart the failed processor can now begin.
Stopping and Restarting an Incident
A processor must complete Stages 2 through 4 within a predetermined time, 3 seconds in one embodiment. If it does not complete those stages within that time, some other processor has probably failed during the regrouping. Therefore, the incident stops and a new incident starts with the processor returning to the beginning of Stage 1. Any cleanup that remains incomplete at the restart completes during the stages of the new incident. Cleanup actions either have no sequencing requirements or have explicitly controlled sequences so that they are unaffected by a restart of the algorithm.
During the restart, the INNER.sub.-- SCREEN and the OUTER.sub.-- SCREEN #.sub.-- 740, #.sub.-- 730 are not reinitialized. By not changing these variables, the processor continues to exclude from the new configuration any processors that have already been diagnosed as not healthy. Processors known to be dead are excluded by the OUTER.sub.-- SCREEN #.sub.-- 740. Processors previously recognized as healthy are the only ones with which the INNER.sub.-- SCREEN #.sub.-- 730 permits the processor to communicate.
The processor accepts status only from recognized processors. Therefore, only a recognized processor can add another processor to the configuration before the end of Stage 2. As Stage 2 ends and Stage 3 begins, the regrouping processors exclude the failing processor that caused the restart from the new configuration when the KNOWN.sub.-- STAGE.sub.-- 2 #.sub.-- 750b is copied to the OUTER.sub.-- SCREEN and INNER.sub.-- SCREEN #.sub.-- 740, #.sub.-- 730. After Stage 2 ends, the configuration does not change until a new incident starts.
Power Failure and Recovery Regrouping
When a processor is powered up, it causes a new incident to start. A word in a broadcast status packet indicates that a power failure occurred so that receiving processors can clear bus error counters and refrain from shutting down the repowered processor's access to the busses or fabric. Depending on the characteristics of the inter-processor communications hardware (busses or fabrics), errors are more likely just after a power outage when components are powering on at slightly different times.
Effects of Inter-Processor Communications Path Failures
The effect on regrouping of a failure of inter-processor communications paths (IPCPs) depends on whether the failure is transient or permanent. A transient failure is one that allows occasional use of the IPCPs to transmit packets. A permanent failure is one that prevents any packet from passing through that component until the component is replaced.
Transient IPCP failures during Stage 1 normally do not affect regrouping. More than one attempt is made to transmit a status packet, and redundant communications paths are used for each packet. Transmission is almost always successful. If transmission on the redundant paths does fail, either the algorithm restarts or the processor stops.
A successfully transmitted packet can be received as one of three types: unique, because a transient IPCP failure occurred and the other copy of the packet could not be sent; duplicated, because it was received over redundant IPCPs; or obsolete, because a processor transmitted a status packet, had its status change, and then transmitted a new status packet, but one or more paths delivered the status packets out of order.
The regroup control template variables are updated by setting bits to 1 but never by setting them to 0. Duplicated, obsolete, or lost packets do not change the accuracy of the new configuration because a bit is not cleared by subsequent updates until a new incident starts. No harm follows from receiving packets out of order.
The handling of permanent IPCP failures differs. When a processor cannot communicate with itself over at least one path, that processor halts with an error. This action means that when all redundant IPCPs fail, the system halts all processors automatically. Regrouping becomes irrelevant.
Failure of an IPCP element or IPCP-access element does not affect regrouping as long as one two-way communication path remains between two processors. A, processor that cannot communicate with at least one other processor halts itself through the monitoring function of the regrouping processor.
A processor that can communicate with at least one other processor is included in the new configuration because the new configuration is achieved by consensus. When each processor receives a status packet, it adds the reported configuration to update its own status records. This combined configuration is automatically forwarded to the next processor to receive a status packet from the updating processor.
For example, consider the following situation: Given redundant IPCPs X and Y, processors 0 and 2 can send only on IPCP X and receive only on IPCP Y. Processor 1, on the other hand, can receive only on IPCP X and send only on IPCP Y. Thus, processors 0 and 2 have a communication path with processor 1. Eventually, all three processors will have the same new configuration. The processor status information from both processors 0 and 2 will have been relayed through processor 1.
Unresolved Failure Scenarios
The pre-existing regroup algorithm works well for processor failures and malatose processors. There are, however, certain communications failure scenarios for which it does not work well. In understanding these scenarios, conceive of a working multi-processing system (such as a NONSTOP.RTM. KERNEL operating system) logically as a connected graph in which a vertex represents a functioning processor and an edge represents the ability for two processors to communicate directly with each other. For a system to operate normally, the graph must be fully connected, i.e., all processors can communicate directly with all other processors. A logical connection must exist between every pair of processors.
(The graph is a logical interconnection model. The physical interconnect can be a variety of different topologies, including a shared bus in which different physical interconnections do not exist between every pair of processors.)
In the first scenario, two processors in the system come to have inconsistent views of the processors operating in the system. They disagree about the set of vertices composing the graph of the system. A "split brain" situation is said to have occurred. This split-brain situation can lead each of the primary and backup of an I/O process pair that resides across the split brain to believe that it is the primary process, with data corruption as a result.
Generally, split-brain situations can occur if communication failures break up a system into two or more distinct clusters of processors, which are cut off from one another. The connectivity graph of the system then breaks into two or more disjoint connected graphs.
In the second scenario, communication failures result in the connectivity graph becoming only partially connected. This happens when communication between a pair of processors fails completely in spite of redundant paths. When one of the processors notices that it has not received IamAlive messages from the other for a certain period, it activates a regroup operation. If, however, there is a third processor with which the two can communicate, the pre-existing regroup operation decides that all processors are healthy and terminates without taking any action. A message originating on either of the processors and destined to the other processor hangs forever: Both processors are healthy, and a fault-tolerant message system guarantees that messages will be delivered unless the destination processor or process is down. Until a regroup operation declares the destination processor down, the message system keeps retrying the message but makes no progress since there is no communication path between the processors.
In this second scenario, the whole system can hang due to one or more of the following circumstances: The global update (GLUP) protocol (described in U.S. Pat. No. 4,718,002 (1988), incorporated herein by reference) that is used for updating the replicated kernel tables assumes that a processor can communicate with all healthy processors in the system. If GLUP starts on a processor that cannot communicate with one of the healthy processors, the GLUP protocol hangs in the whole system, preventing the completion of activities such as named process creation and deletion. A system may also hang if a critical system process hangs waiting for the completion of a hung message.
Such system hangs could lead to processors halting due to the message system running out of resources.
Where the inter-processor communication path is fault-tolerant (e.g., dual buses) while the processors are fail-fast (e.g., single fault-detecting processors or lock-stepped processors running the same code stream, where a processor halts immediately upon detecting a self-fault), the likelihood of communication breakdown between a pair of processors becomes far less likely than the failure of a processor. However, a software policy of downing single paths due to errors increases the probability of this scenario.
Further, with the introduction of complex cluster multi-processor topologies, connectivity failure scenarios seem more likely. These could be the result of failures of routers, defects in the system software, operator errors, etc.
In the third scenario, a processor becomes unable to send the periodic IamAlive messages but nonetheless can receive and send inter-processor communication messages. (Such a situation results from, for example, corruption of the time list preventing the reporting of timer expirations to the operating system.) One of the other processor readily detects this failure of the processor and starts a regroup incident. However, since the apparently malatose processor can receive the regroup packets and can broadcast regroup packets, the faulty processor fully participates in the regroup incident. This participation is sufficient to convince the other processors that the apparently malatose processor is in fact healthy. The processors quickly dub the regroup incident a false start and declare no processors down. A new regroup incident nonetheless starts the next time a processor detects the missing IamAlives. Thus, the system goes through periodic regroup events at the IamAlive-checking frequency (e.g., once per 2.4 seconds), which terminate almost immediately without detecting the failure.
Accordingly, there is a need for a multi-processor regroup operation that avoids these split-brain, partial-connection and timer-failure scenarios.
A goal of the present invention is a multi-processor computer system wherein the constituent processors maintain a consistent image of the processors composing the system.
Yet another goal of the present invention is a multiprocessor computer system wherein the constituent processors are fully connected when the system is stable.
Yet another object of the present invention is a multiprocessor computer system wherein the failure of the processor to receive timer expirations is detected and the processor declared down.
Another goal of the present invention is such a multi-processor system, where said processors are maximally fully connected when the system is stable.
An object of the invention is such a multi-processor system, where the system resources (particularly, processors) that may be needed for meeting integrity and connectivity requirements are minimally excluded.
Another object of the invention is such a multiprocessor system where, when regrouping, the system takes into account any momentarily unresponsive processor.
These and other goals of the invention will be readily apparent to one of ordinary skill in the art on the reading of the background above and the description following.