The present invention relates to a fault-tolerant distributed computer system, and, in particular, to a distributed computer system that relies upon timeout to detect if a component of the system has failed. That is, when the system waits more than a certain period of time for an action to occur, it declares something is wrong. The present invention makes use of statistical methods to improve the system""s timeliness by reducing the duration of the waiting period prior to the system issuing a timeout.
Distributed computer systems consist of individual computers interconnected by a network. The individual computers cooperate in executing a common program by exchanging messages over the network. The individual computers are called xe2x80x9cthe computing nodesxe2x80x9d or simply xe2x80x9cthe nodesxe2x80x9d. The distributed computer system itself is called xe2x80x9ca distributed systemxe2x80x9d. In a fault-tolerant distributed system the distributed program continues to execute its instructions even if some nodes fail.
A node may fail by simply ceasing to execute instructions. Since it does not send spurious or corrupted information over the network, other nodes cannot sense the failure. The failed node merely remains silent by not responding to messages sent to it by other nodes. No response to a message shows the sending nodes that the receiving node has failed. However, even nodes that have not failed do not respond instantaneously. Transmission delays in the network account for some lack of responsiveness. Delays can be further compounded if messages are service requests that the receiver satisfies with its response.
Since absence of a response is insufficient for deciding that the receiving node has failed, the sending node has to set a limit on the time it will wait. This wait is called a timeout period. If the timeout period passes without the anticipated action, then a timeout has occurred.
Deciding how long the timeout period should be is critical to the overall operation of the distributed system. Too short a timeout period, when operating nodes prolong their responses merely because of a heavier-than-normal workload, can cause the system to regard a node as failed. Too long a timeout period allows a failed node to suspend system operation until the timeout occurs.
Watchdog timers, as described by Johnson (Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, 1989, Pages 68-69) and Lee and Anderson (Fault Tolerance Principles and Practice, 2nd Edition, Springer-Verlag, 1990, Pages 104-105), enable a timeout. To detect a lack of response, timing checks are imposed on tasks at mandatory intervals during execution. Prior to the end of each interval, receipt of an xe2x80x9cI am alivexe2x80x9d message is expected. The watchdog timer is set to a value that corresponds to the expected execution time until the next xe2x80x9cI am alivexe2x80x9d message. There has to be leeway to compensate for slight variations in execution time within an interval. But it is easier to estimate the expected variation in several small intervals during the task than to estimate when the entire task should be completed. As more time passes prior to a timing check, the longer will a processor executing a task be exposed to factors that cause it to drift from what might be xe2x80x9cexpected.xe2x80x9d (Note that expectation here is not the mathematical expectation that corresponds to the mean of a statistical distribution. Instead its connotation is what the system designer believes is reasonable, based on the task""s demands for algorithms and resources.)
Watchdog timers ease the constraints on limiting a waiting period. They do so by substituting several small problems, each of which is easier to solve, for the whole problem. But watchdog timers have drawbacks. Sending an xe2x80x9cI am alivexe2x80x9d message may be difficult unless software was specifically written to support watchdog timers. Even if the software does support watchdog timers, sending a message requires that a task suspend execution, thereby reducing system throughput. In distributed systems, frequent xe2x80x9cI am alivexe2x80x9d messages create traffic that causes network congestion. System performance suffers, thereby compounding the problem of estimating the shortest practical timeout period.
The prior-art method of watchdog timers thus does not succeed in limiting the timeout period in fault-tolerant distributed systems. One alternative is to permit unbounded message delays, forego automating the calculation of an optimal timeout, and let a human operator detect the lack of system response. However, the comparatively slow reaction of a human eliminates this alternative from consideration for all applications but those few where an operator is present and a rapid response is not required.
Unfortunately the cases are rare where one can accept unbounded message delays. So we require a fault-tolerant distributed system that ensures the existence of an upper bound d on the timeout period for nodes that work properly. See Cristian, xe2x80x9cUnderstanding Fault-Tolerant Distributed Systems,xe2x80x9d 34 Communications of the ACM (February 1991), the disclosure of which is incorporated herein by reference.
To mitigate some unwanted side effects of watchdog timers, designers can reduce the number of timing checks by increasing the intervals between them. Considering only the completion time for the entire task can eliminate intermediate checks. In either case, designers must choose a limit to the length of a timeout period. A timeout""s duration is thus based on a designer""s assumptions about the real-time behavior of the operating system, the maximum system load, and the application of massive redundancy to the communication hardware. The designer tries to ensure a negligible chance that network partitions will occur as described by Cristian (Ibid.). To avoid making a timeout period too short, it must be based on worst-case scenarios. Even though the worst case may be most unlikely, the prior art treats a conservative approach based on a worst case as superior to risking the inadvertent loss of operating nodes through premature timeout.
Timeout is important for working nodes to detect a node that fails by omitting a response. However, a failed node may send a timely response whose data is corrupted. To cope with such a failure, it is necessary to replicate a particular task on several nodes at the same time. By creating more than a single instance of the task, discrepancies can be detected by comparing the outputs from every node that executes the task. When three or more nodes execute the same task, the correct output is presumed to be the result of a xe2x80x9cvotexe2x80x9d among them. That is, each node offers its own solution to the task, and the system brings all the solutions together to decide which commands a majority. Each node that executes a redundant task communicates its results to a voter that collates all the results. Making the voter""s output represent the majority result from the nodes masks an erroneous result. More than half the results must agree to form a majority.
The voter may be a specially designated node, or it may be distributed among the nodes. At the start of the task, nodes are synchronized so that voting takes place when the task is completed. A centralized voter that has independent connections to the nodes can collect their results in parallel. When the voter is distributed, each node broadcasts its results to the other nodes, so that each node determines the majority.
Timing is critical in voting. If results arrive at the voter at slightly different times, incorrect results can be generated temporarily. In many applications, an incorrect result cannot be allowed for even a very small period. Furthermore, if some of the initial results arrive at a voter simultaneously, a remaining node may be incorrectly declared faulty because its results arrived at the voter after voting took place. For these reasons, it is important to synchronize voting.
Some voting schemes permit unsynchronized nodes. The unsynchronized inputs are first marshaled and then fed simultaneously to the voter so that they appear synchronized. For example, Fletcher et al. (U.S. Pat. No. 3,783,250 issued Jan. 1, 1974, col. 5, line 58ff.) teaches the use of buffer shift registers that allow the nodes supplying the voter inputs to be out of synchronization by as much as one-halfword. Clearly there must be an assumed amount of permissible drift among the nodes, and a limit placed on how much the voter inputs may be unsynchronized. The problem of a failed node that may not respond is addressed by limiting the time spent on marshaling voter inputs. Eventually, through this implicit timeout period, the vote takes place. Hence a fault-tolerant computer system may employ a timeout to prevent having the voter wait indefinitely.
Avoiding indefinite waiting was a problem from the very beginning of distributed computer systems. A sender and a receiver had to cooperate for a distributed program to run. Yet there had to be a way to detect a lack of response. Because the problem arises from a single event (i.e., a response), one could not apply statistical techniques. They require a minimum of two sample values from which to compute a mean and a standard deviation. Timeouts were therefore based on the assumptions of the designer. Later it became necessary in fault-tolerant distributed systems to tolerate nodes that sent corrupted data. Nodes could still fail by stopping, but to mask erroneous data it was necessary to use redundant nodes. The earlier structure of timeouts was simply carried over.
An early innovation in fault-tolerant distributed computing systems was to enforce a time limit on how long a node should wait for another node to respond. This limit was based on anticipating the longest possible delay. Otherwise, too short a limit could result in assuming that an operating node has failed when, in fact, its response is legitimately prolonged. Waiting for a response is clearly not productive; yet it is far better to be temporarily non-productive than to falsely declare a node as failed and stop using it altogether. As a result, reduced performance by incorporating timeout periods is presently accepted in the art as a necessary evil.
The present invention provides, in a fault-tolerant distributed computer system, an improved means for implementing timeouts among computing nodes that process the results of a redundant task. Until the present invention, the plurality of nodes in a redundant computation has not been used to sample the response times statistically. This oversight is remedied in the apparatus and method of the present invention.
One object of the present invention is to provide apparatus and method for using the present behavior of the nodes executing a redundant task to forecast future behavior, thereby abandoning the prior art""s reliance on worst-case scenarios.
Another object of the present invention is to provide apparatus and method that reduces the timeout period.
Still a further object of the present invention is to provide apparatus and method that has low computational overhead for recalculating a timeout period.
Briefly stated, the present invention teaches apparatus and method to reduce the duration of timeout periods in fault-tolerant distributed computer systems. When nodes execute a task redundantly and communicate their results over a network for further processing, it is customary to calculate timeouts on a worst-case basis, thereby prolonging their duration unnecessarily. By applying Tchebychev""s inequality, which holds for any statistical distribution, to adaptively determine the distribution of the arrival times of the results at the point where further processing of those results takes place, the duration of timeouts is reduced. Successively refining the statistical distribution of the arrival times leads to an improved forecast of future arrivals. Thus timeouts are kept to a minimum without compromising the reliability of the system.
According to an embodiment of the invention, in a fault-tolerant, distributed computer system, apparatus to minimize duration of waiting prior to timeout comprises: a plurality of nodes; each of the nodes being capable of sending at least one message; and a collector effective for processing at least two of the messages received from the plurality of nodes to determine if any of the plurality of nodes is faulty.
According to a feature of the invention, in a fault-tolerant, distributed computer system, a method of minimizing duration of waiting prior to timeout comprises the steps of: attempting to send at least one message from each of a plurality of nodes to a collector; determining at the collector what information is contained in each the at least one message from at least two of the nodes; determining whether any of the plurality of nodes has failed the step of attempting; and acting on the information to determine if any of the plurality of nodes is faulty.
According to another feature of the invention, in a fault-tolerant, distributed computer system, apparatus for minimizing duration of waiting prior to timeout comprises: means for sending at least one message from each of a plurality of nodes to a collector; first determining means for determining at the collector what information is contained in each the at least one message from at least two of the nodes; second determining means for determining whether any of the plurality of nodes has failed to send the at least one message; and means for acting on the information to determine if any of the plurality of nodes is faulty.
These and many other objects and advantages of the present invention will be readily apparent to one skilled in the pertinent art from the following detailed description of a preferred embodiment of the invention and the related drawings.