This invention relates generally to fault-tolerant distributed or clustered multiprocessor systems. More particularly, the invention relates to methods, and apparatus implementing those methods, for improving the resilience of a multiprocessor system in partial and total communication failure scenarios, and in the face of failure of periodic or timed events on a constituent processor. Thereby, system fault tolerance is improved.
One inventive method of achieving fault-tolerant capability in distributed multiprocessor architectures is to detect processor failures quickly with an xe2x80x9cI""mAlivexe2x80x9d protocol (described in U.S. Pat. No. 4,817,091). Briefly, the I""mAlive protocol involves each processor of the system periodically sending or otherwise broadcasting I""mAlive message packets to each of the other processors in the system. Each processor determines whether another processor is operational by timing I""mAlive packets from it. When a processor sees that the time interval passes without receipt of an I""mAlive packet from a given processor, the first processor decides that the silent processor might have failed.
The I""mAlive protocol message scheme is often combined with a xe2x80x9cprocess pairxe2x80x9d mechanism, comprising a primary process installed and running on one processor of the multiple processor system with a copy of that process operating as a backup process on another processor of the system. Periodic updates are sent from the primary process to the backup process so that should the primary process, or the processor it runs on, fail, the backup can take over for the primary process with minimal interruption. An example of the use of process pairs can be found in the above-identified ""091 patent.
Unfortunately, there are situations in which a processor will be late in broadcasting its required I""mAlive message, in turn causing one or more of the other processors to assume that the tardy processor has failed. These situations, in turn, can give rise to such actions (to name a few) as: (1) both of the processes of a process pair (running on different processors) regarding themselves as the primary, destroying the ability to perform backup functions and possibly corrupting files; or (2) all system processors becoming trapped in infinite loops, contending for common resources; or (3) corrupting various system tables. Although such situations are rare, they are possible and have been observed in systems developed prior to implementation of a xe2x80x9cRegroupxe2x80x9d technique (described below). For fault tolerant systems, such situations must be made practically non-existent.
To supplement the I""mAlive protocol, and to avoid the problems referred to above, a technique referred to as xe2x80x9cRegroupxe2x80x9d was developed. Triggered when a processor fails to see an I""mAlive message within a prescribed check period, Regroup begins with messages being exchanged between all processors of the system in order to enlist them in the Regroup operation. Regroup then employs a consensus algorithm to determine the true state of each processor in the system by having each volunteer its record of the state of all other processors. Each processor compares its own record with records from other processors and updates its record accordingly. When the consensus is complete, all processors have the same record of the system""s state. The processors will have coordinated among themselves to reintegrate functional but previously isolated processors and to correctly identify and isolate nonfunctional processors.
Later developments have refined the Regroup technique, allowing its use to determine membership even when physical communication among processors is lost. See, for example, U.S. Pat. No. 5,884,018 for xe2x80x9cMethod and Apparatus for Distributed Agreement on Processor Membership in a Multi-Processor System.xe2x80x9d To the extent necessary, said U.S. Pat. No. 5,884,018 is incorporated by reference as if fully set forth herein.
As indicated above, a missing or tardy I""mAlive message can cause each of the two processes of a process pair to operate as if it is the primary process. This, in turn, gives rise to the possibility of data corruption caused by both processors of the pair trying to write (and/or overwrite) portions of a disk drive (or other I/O controller) managed by the two processes. To avoid this situation, Regroup is structured to require each processor, at the start of a Regroup event, to invoke a xe2x80x9cHold I/Oxe2x80x9d state in which all input/output transmissions, except those necessary to the Regroup operation, are suspended. The Hold I/O state is turned off for those processors determined to be fit to continue operation. Any processor(s) determined to be faulty (either on its own, or by the Regroup operation) will continue the Hold I/O state until it halts. In addition, to preclude premature takeover of a process"" operations by its backup process, the first stage of Regroup is xe2x80x9cstalledxe2x80x9d (i.e., extended) long enough to ensure that all processors have entered the Regroup operation and invoked the Hold I/O state. It has been determined that the stall period is at least two I""mAlive checking periods (2Tcheck), plus a safety margin to cover message transit time and to account for small differences in the processor clocks in the multiple processors of the system; one extra Ttick more than covers these small effects. (Ttick, approximately 0.3 seconds, is a basic time unit in the I""mAlive messaging operation as well as the Regroup algorithm. In addition to its use in xe2x80x9cpacingxe2x80x9d the sending of I""mAlive messages every two Ttick intervals and checking every four Ttick intervals, the Regroup operation, when activated, proceeds in multiple rounds, the interval between which is Ttick. That is, when Regroup is active, on every Ttick interval, every processor sends its Regroup messages to all other processors.)
The stall period described in the previous paragraph is known as stage 1 of the Regroup operation. Under certain special circumstances, a processor participating in a Regroup operation may determine that the duration of the stall period (or stage 1) must be extended beyond 2Tcheck+Ttick. This is referred to as a xe2x80x9ccautious modexe2x80x9d Regroup operation. While in cautious mode, the processor invokes Hold I/O and waits longer for stage 1 to complete. The situations leading to a processor entering cautious mode are:
1. The processor detects that two or more peer processors are missing in the current Regroup operation. Note that this situation is possible if the processor is isolated, or if the system is subject to multiple processor and/or communication path failures.
2. The Regroup operation is restarted due to processor or communication failures that occur during the Regroup operation itself.
3. The Regroup operation is started because the multiprocessor system is recovering from a power outage. Because the recovery speed of individual processor units may vary, processor units must wait longer in stage 1 of the Regroup operation.
The I""mAlive protocol is most often employed in a prioritized interrupt scheme in which the work done by the processor in order to construct and send I""mAlive messages is assigned a low priority level. Message construction allows the I""mAlive message to provide a fair indication of the state of well-being or health of the sending processor. It informs the other processors that the sending processor is capable of doing more than just sending I""mAlive messages. However, higher priority tasks can delay the I""mAlive interrupt, and the send of an I""mAlive message, long enough so that another of the processors of the system, failing to see the I""mAlive message within the requisite time, will initiate the Regroup operation. Regroup operations that are triggered by late sending of I""mAlive messages (as opposed to a processor failure or communications path failure) can be considered a xe2x80x9cfalse alarm.xe2x80x9d Regroup will run, and discover no real problem. This may happen repeatedly, even frequently, under certain types of heavy processor loads such as a high rate of I/O interrupts requiring substantial processing. These false alarm Regroup operations not only use resources unnecessarily, they are somewhat risky in possibly halting a tardy but otherwise healthy processor, and they obscure diagnosis of what is really happening in the system. A candidate solution to this problem is to increase the frequency with which the I""mAlive messages are sent. However, there is an upper bound to increasing the frequency of I""mAlive message broadcasts where they begin to significantly impact upon the efficiency of processor operations.
Another candidate solution would be to decrease the frequency of I""mAlive checking (but not necessarily the frequency of sending). This would almost certainly decrease the probability of false alarm Regroup operations, but at the significant cost of increasing the failure detection latency and also extending the duration of the Hold I/O time period. A important goal of the present invention is to both decrease the failure detection latency and to reduce the duration of the Hold I/O (described below) period.
Thus, it can be seen that there exists a need to reduce both the time for detecting a problem and entering Regroup and the time needed for the Hold I/O state.
Broadly, the present invention supplements the I""mAlive method with periodic path probe messages that function to check all communicative paths (and, to some extent, the operability of the destination processor unit) between every processor unit of a distributed system and all other processor units of that system. Path probe messages are structured to be sent at greater frequencies than I""mAlive messages, resulting in quicker detection of problems and a much faster initiation of the Regroup operation. In addition, the present invention produces a significant reduction of the xe2x80x9cstallxe2x80x9d period for ensuring that all processor units have entered the Regroup operation and invoked the Hold I/O state.
The invention is preferably used in a multiprocessor system in which multiple processor units are communicatively intercoupled to form a clustered system. According to the invention, each of the processor units will periodically send to all other processor units, on all available interconnecting paths, a path probe message. All processor units will respond to receipt of all received path probe messages with an acknowledgment (ACK) or a negative acknowledgment (NACK). If a sending processor unit fails to receive an ACK from any of the processor units to which a path probe message was sent within a prescribed amount of time, or receives a NACK response, a Regroup operation will be initiated to determine processor unit membership in the system, and the connectivity of those processor units. According to one embodiment of the invention, every processor unit entering the Regroup operation will invoke a xe2x80x9cHold I/Oxe2x80x9d to suspend input/output activity (other than Regroup messages).
The Hold I/O ensures, in the case of a full disconnect between two different processors, that a takeover won""t happen until it is guaranteed that both primary and backup processes of an I/O process pair won""t both act as primary (owner) of a particular disk drive (or other I/O controller). All processor units will have the I/O transfers on hold, and will wait until the Regroup operation completes before resuming such transfers.
In another embodiment of the invention, processor units will not invoke Hold I/O upon entering the Regroup operation. Rather, when Regroup establishes that certain of the processor units, if any, must halt (e.g., because they no longer have the required connectivity for communicating with the other processor units of the system), those processor units will then invoke Hold I/O. Again, regroup operation is extended (xe2x80x9cstalledxe2x80x9d) a predetermined amount of time to ensure that all halting processor units have invoked Hold I/O.
The present invention offers a number of advantages, which are achieved by both of its embodiments. First, since the path probe messages of the invention are used primarily to check the health of the communication between the sending processor unit and all other processor units of the distributed system, a higher priority interrupt handler can be usedxe2x80x94assuming the processor unit has a prioritized interrupt structure permitting path probe messages to be sent with greater frequency to detect communication problems much quicker. In fact, the sending of path probe messages would preferably use an interrupt handler with a priority higher than that of other interrupt handlers in order to avoid having path probe transmissions potentially starved by other (higher priority) interrupt handlers.
Further, path probes are meant to primarily to check interprocessor communication by eliciting a response from the target processor unit. As such, path probe message construction can be minimal and, therefore, does not have the overhead of I""mAlive message transmission. Thus, path probe operation does not have more than a minimal impact on the performance of a processor unit.
The use of path probe transmissions can check the connectivity to a destination processor unit and the destination processor unit will not even know that it is being xe2x80x9cprobed.xe2x80x9d This is possible if proper hardware support is provided. The destination processor unit will not be interrupted by a path probe message. An ACK or a NACK will be returned by the hardware to the sender processor unit, depending upon the outcome of certain permission checks (described below) accomplished on the path probe message by the destination processor unit.
Further, in distinction to the I""mAlive method without the supplemental path probing taught in the present invention in which it is the receiver that is checking for a lack of I""mAlive messages, the present invention employs the sending processor unit to check for a timely response to transmitted path probe messages. There are expected latencies attendant with having a receiver of a message initiate time-based checking. In contrast, the present invention can significantly accelerate detection of a failed processor or communications path by using the message sender as the checking entity of the paths.