1. Field of the Invention
This invention relates to a loosely coupled distributed computer system for real time applications consisting of a communication medium for serial communication and a number of node computers with a local real time clock in each node computer.
2. Description of the Prior Art
A distributed computer system consists of a number of autonomous computers (nodes) which are loosely connected by a communication subsystem (Local Area Network=LAN). Such a loose connection is characterized by the fact that only serial messages are exchanged between the nodes. Such message is a sequence of bytes (typical in the range of ten to many hundred bytes) treated as an atomic unit. There is enough redundancy within a message (error detecting codes, e.g. CRC) that a change of the content of a message resulting from a transmission error can be detected with a sufficient high probability. It is therefore justified to assume that a message arrives either with correct content or not at all (an erroneous message is simply discarded). According to the state of the art every node of a distributed real time system contains its own local real time clock. The accuracy of these clock is predominantly determined by the accuracy of the local quartz crystal, i.e. the relative error is in the order of 1 ppm. A real time application, i.e. a real time process which is controlled by a distributed real time system, requires the synchronization of the local real time clocks of each node. The synchronized time will be called the approximate global time or global time for short. The accuracy of this synchronization determines the units of time, which can be measured by this system. In a distributed real time system this synchronization can be realized by the exchange of messages (it is then unnecessary to implement separate channels for the synchronization of the clocks). The synchronization procedure should be fault tolerant i.e. a faulty clock or a missing message should be tolerated.
Algorithms useful for carrying out the process of the invention for the fault tolerant synchronization of real time clocks have been published. For example, a prototype utilizing one such known algorithm for a highly reliable real time system has been built and is described in the research project SIFT (J. H. Wensley, et al, SIFT (Software Implemented Fault Tolerance): The Design and Analysis of a Fault Tolerant System for Aircraft Control, Proceeding of the IEEE Vol. 66, No. 10, p. 1240-1255, October 1978). A minicomputer, which was available in the open market, was chosen for the node of the system. The synchronization algorithm therein disclosed is executed in the (single) CPU of this node, in parallel to the application software.
For further details, reference should be made to the SIFT publication, the disclosure of which is incorporated by reference. A copy of the SIFT publication has been filed and is of record in the parent U.S. patent application No. 747 014.
The algorithm is carried out in two parts. In the first part, each clock computes a vector of clock values, called the interactive consistency vector, having an entry for every clock. In the second part, each clock uses the interactive consistency vector to compute its new value. A clock p computes its interactive consistency vector as follows. The entry of the vector corresponding to p itself is set equal to p's own clock value. The value for the entry corresponding to another processor q is obtained by p as follows.
(1) Read q's value from q. PA1 (2) Obtain from each other clock r the value of q that r read from q. PA1 (3) If a majority of these values agree, then the majority value is used. Otherwise, the default value NIL (indicating that q is faulty) is used.
One can show that if in a set of four clocks at most one of the clocks is faulty, then (1) each nonfaulty clock computes exactly the same interactive consistency vector; and (2) the component of this vector corresponding to any nonfaulty clock q is q's actual value.
Having computed the interactive consistency vector, each clock computes its new value as follows. Let .delta. be the maximum amount by which the values of nonfaulty processors may disagree. (The value of .delta. is known in advance, and depends upon the synchronization interval and the rate of clock drift.) Any component that is not within .delta. of at least two other components is ignored, and any NIL component is ignored. The clock then takes the median value of the remaining components as its new value.
The difference between this median value and the value of the local real time clock gives the state correction term for this clock. This state correction term is written into the appropriate register of the synchronization unit which then performs the synchronization as explained more below. Each SIFT processor reads the value of its own clock directly, and reads the value of another processor's clock over a bus. It obtains the value that processor r reads for processor q's clock by reading from processor r's memory over a bus.
Since publication in 1978 of the described known algorithm, a number of other algorithms have been developed which may be used for purpose of the present invention, notably those disclosed in U.S. Pat. Nos. 4,531,185 and 4,584,643, the disclosures of which are incorporated by reference.
Also a survey of useful algorithms is disclosed in Proceedings of the Advanced Seminar on Real Time Local Area Network in an article by F. Schneider entitled "A Paradigm for Reliable Clock Synchronization" (INRIA, Rocquencourt, France, 1986). The algorithms disclosed in this publication can be calculated either in the CPU and/or in the synchronization unit.
Leading to the present invention was the realization that:
(1) The processer load of the fault tolerant synchronization algorithm increases significantly with the number of the nodes and the number of the tolerated faults. It has been shown (Shin, K. G., Krishna, C. M., Synchronization and Fault Masking in Redundant Real-time Systems, Proc. FTCS 14, Kissimee, Fla., p. 152-157) that this processing load approaches the processing capacity of modern microcomputers.
(2) In a distributed system which is synchronized by the exchange of messages, the inaccuracies of the measurements of the points in time of sending and receiving a message are the determining factors for the achievable accuracy of synchronization.