Loosely coupled, distributed multiprocessor computing systems were known in the prior art and have been used in a wide variety of applications and environments. Control information in such systems has been kept by each of the multiple processors in order to ensure that the processors have been operable as a single, unified multiprocessing system. Changes (i.e. updates) to the control information in one processor have required updates to the control information in every other processor, so that the control information kept by the processors has been consistent throughout the system.
In order for the multiple processors to remain coordinated and informed of current system status, global update messages have been broadcast by any one of the processors, acting as a sender processor, to all of the other currently operational processors throughout the system. Herein, a "global update"means an operation carried out in a distributed multiprocessor computing system which makes a consistent change to the control information in all of the operational processors of the system.
For example, an input/output (I/O) device of such a system might have been operated from more than one processor, but the system (all processors) must have agreed as to which processor was to be controlling a given I/O device at any particular time. When control of an I/O device was passed from one processor to another processor, each of the processors of the system was informed of this fact by, e.g., a global update message broadcast from a sender processor to each of the other processors. The update was then made in each processor, so that all processors had accurate, current and consistent control information relating to the system.
In order for control information in each processor to remain consistent, accesses and updates to control information are typically carried out as "atomic operations". An atomic operation is indivisible: one which is begun and completed before it is treated as valid. In order for an operation to be considered to be atomic, each access to control information should be consistent and should not obtain partially updated data; each global update should be carried out successfully on all currently operational processors or on none of them; successive global updates should occur in the same order on all processors; and, each global update procedure should be carried out within some maximum time limit. While atomic operations are desirable for accessing and updating control information, in the prior art, atomic operations provided tolerance for single faults only. Also, it is to be understood that individual processor acceses to control information occur much more frequently than global updates of such information.
A processor failure during a global update broadcast has been a primary cause of inconsistency in control information in prior systems. If a failure of a sending processor occured during an update broadcast, some receiving processors might have been updated and other processors might not have been updated. The failure of a sender processor destroyed the ability of the prior system to make the sender's update global, even though it had reached some of the other processors before the sender failed. Failure of a dedicated update message supervisor processor created the situation where inconsistent updates were likely to be made to control information in various processors throughout the system, leading to inconsistency and potentially to system breakdown.
Schemata are known in the prior art for updating of control information when multiple processors of a distributed multiprocessing system have failed. These schemes typically have been very complicated and have required the passing of an exponential number of messages between the remaining up processors wherein the exponent is related to the number of surviving processors. Such schemes could quickly become computationally intractable. They required too much agreement between the remaining up processors, as well as requiring unacceptable levels of system resources. For example, extensive polling and voting procedures were invoked, and the consequences of majority decisionmaking led to further complications and delays.
One prior art system which has provided an improved level of single-fault tolerance is described in detail in U.S. Pat. No. 4,228,496; and, improvements and variations of the system described in this prior patent are currently offered by the assignee of the present invention in its NonStop.TM. family of computers. Such systems typically comprise from two to sixteen distributed processors that are connected by a pair of high-speed interprocessor buses. A bus controller enables any processor to send a message directly to any other processor. Interprocessor communications are carried out by messages, rather than by shared memory. Although shared memory is not used in the system described in the prior patent, shared memory may be employed for data storage in a multiprocessor system with some degradation of tolerance to faults: i.e. failure of the shared memory disables the entire system. Providing redundancy in the shared memory to increase fault tolerance significantly increases the number of messages that must be passed and degrades system throughput rate.
The system described in the referenced U.S. Pat. No. 4,228,496 is single-fault tolerant. That is to say, such system continues to function correctly despite the failure of any single component. Each of the two interprocessor buses is a separate component, so that up processors may continue passing message packets back and forth even if one bus fails. If both buses fail, then the processors cannot communicate with each other, and the prior patented system ceases to operate as intended.
In the prior patented system at least two processors can alternatively control every I/O device and other system function (resource). If any one processor fails, then the other processors act to provide the functions formerly provided by the failed processor. If more than one processor fails in the patented system, then the system may cease to provide all functions.
Each processor in the distributed multiprocessing system described in the prior patent is either in an "up"condition or state or is in a "down" condition or state. An "I'm alive" protocol is employed by each of the up processors of the prior patented system in order to detect the failure of a processor. This protocol is used to keep the control information of the up processors of the system current as to system resources presently available.
In executing the "I'm alive" protocol, about every n seconds, each processor sends an unsequenced acknowledgement message packet over each bus to every other processor. The message packet has two purposes: to recover from lost acknowledgements and to tell the other processors that the sender processor is up. About every 2n seconds, each processor checks whether it has received an unsequenced message packet from every other processor. If a message packet has not been received from a processor thought to be up, the receiver processor considers the sender processor to be down and adjusts its control information to remove the sender processor as an available processor.
The prior patented system used a fixed pair of processors which coordinated global updates. The prior system required a duplicate update message packet to be distributed to each of the coordinating processors. This duplication increased the number of messages required to carry out a global update and provided only single-fault tolerance: if both coordinating processors failed, the system failed to operate properly, since it was then not possible to maintain currency and consistency of control information in the remaining up processors.
Although the single-fault tolerance provided by the prior patented system was a substantial improvement in the art, certain situations have arisen wherein tolerance to multiple faults (failures) of processors is needed. A key requirement in providing tolerance to multiple faults is to maintain consistent control information in all processors remaining up after a processor failure, irrespective of the number of processors that have failed, and irrespective of the failure of the update coordination processor. When multiple failures occurred in the prior art systems, changes in the control information could not be broadcast to every up processor as an atomic operation.
Thus, a hitherto unsolved need has arisen for a simplified and more reliable communications method for a multiprocessor system in which global updates of control information are carried out successfully in the face of multiple faults: failures of multiple processors.