This invention relates to a method of determining a uniform global view of the system status of a distributed computer network comprising at least three computers. The invention further relates to a distributed computer network for carrying out the method.
In distributed computer networks, changes in system status occasionally occur as a result of intended events (e.g., addition of a new computer) or unintended events (e.g., failure of a computer). On the occurrence of such a change, it must be ensured that the computers in the computer network get a uniform global view of the new system status as quickly as possible. The problem of how to bring about a uniform global view of the system status is frequently also referred to as a xe2x80x9cmembership problemxe2x80x9d.
This membership problem is particularly important in distributed computer networks which are used to monitor and control processes critical with regard to safety, such as in railway signaling or in power plant technology. In such computer networks, the individual computers compare their results. Results are output to the process only if they were determined independently of each other by a majority of the computers. If, in a network of three computers, for example, one of the computers fails, the other two computers can continue to deliver results to the process. This requires, however, that these two computers have come to a uniform global view of the system status, i.e., there must be agreement upon which of the computers has failed and which of the computers are free from faults.
From a publication by L. E. Moser et al entitled xe2x80x9cMembership Algorithms for Asynchronous Distributed Systemsxe2x80x9d, 11th Int. Conf. on Distributed Computing Systems, Arlington, Tex., USA, May 1991, pages 480-488, different algorithms for solving the membership problem in an uncoupled distributed computer network are known. These algorithms are based on a failure hypothesis according to which the computers send either no messages or correct messages. The case where a computer sends erroneous messages is not assumed. The algorithms described use messages whose transmission is repeated if a receiver has not received the message. In addition, there are messages whose transmission is not repeated in such a case. This latter group includes, for example, the request messages, by which a computer notifies the other computers that it wants to become a member again. Admission to such a request is granted by the other computers via specific grant messages. The algorithms described there are limited to uncoupled computer networks and cannot readily be applied to synchronous or virtually synchronous distributed networks.
It is therefore an object of the invention to provide a method of determining a uniform global view of the system status of a synchronous or virtually synchronous distributed computer network comprising at least three computers. Another object of the invention is to provide a distributed computer network for carrying out the method.
These objects are attained, according to the invention, by a system wherein communication among the computers is implemented in the form of transmission rounds. A transmission round is characterized in that in in such a round, each of the computers receives a message from each of the other computers in the absence of an error. Each of the computers evaluates the messages received from the other computers and, based on the result of the evaluation, assigns one of at least three differently defined computer states to each of the other computers. In this manner, each computer determines its own local view of the system status. The computers exchange these local views. Each computer then determines a global view of the system status from the received local views, for example by subjecting the local views to a majority decision. As all of the computers have the same local views, they all come to the same global view of the system status.
This method places no exacting requirements on the synchrony of the transmission. It only requires that within a period of time which need not be fixed but must be finite and limited, each of the computers has received a message from each of the other computers. The method can thus be used with communications protocols according to which computers may send only during permanently assigned time slots, but also with a few communications protocols where such a fixed assignment does not exist.
Furthermore, use of the method according to the invention requires no specific sequences of operations to start up the distributed computer network, whereby the complexity of the computer network is reduced significantly.
Further advantageous features of the invention, will be apparent from the description below and the appended claims.