In a multi-processor computer network, some applications are executed using a master-slave configuration. In such a system, one computer acts as a master computer, collecting information and/or performing important computational tasks. The slave computers may compute and send information to the master computer, or merely remain available to take over the master tasks, if the master is unable to perform them.
For example, the programs timed and TEMPO are local area network clock synchronizers, which are executed using a master-slave configuration. In both of these programs, each slave executes a time daemon (i.e., a task or application program that runs continuously), which periodically sends a message to the master. The time messages include each slave's concept of the network time. Another time daemon, executed by the master, computes the network time as an average of the times provided by non-faulty clocks, and sends to each slave time daemon a correction that the slave time daemon should perform on the clock of its machine.
The “Time Synchronization Protocol” (TSP) is used by the programs timed and TEMPO to support clock synchronization messages, as is described in detail in “The Berkeley UNIX Time Synchronization Protocol,” Gusella, et al. (1986). In general, all the communication occurring among the time daemons uses the TSP protocol. The message format in TSP is 8-bit-byte oriented, and is the same for all message types. The structure of each TSP message is as follows:                1) A one byte message type;        2) A one byte version number, specifying the protocol version which the message uses;        3) A two byte sequence number to be used for recognizing duplicate messages that occur when messages are retransmitted;        4) Eight bytes of packet specific data. This field contains two 4 byte time values, a one byte hop count, or may be unused depending on the type of the packet; and        5) A zero-terminated string of up to 256 ASCII characters with the identity of the machine sending the message.        
TSP also supports messages for the election that occurs among slave time daemons when, for any reason, the master disappears, as is described in detail in “An Election Algorithm for a Distributed Clock Synchronization Program,” Gusella et al. (December 1985). Basically, the election process chooses a new master from among the available slaves when the original master ceases to function properly.
When started up, each slave time daemon randomly selects a value for an “election timer” from a predefined range. When the master time daemon is working, it periodically resets each slave time daemon's election timer by sending out a synchronization message. If a slave does not receive a synchronization message before its election timer expires, improper functioning of the master is assumed. Accordingly, the slave whose election timer expires first will become a candidate to become the new master. If the candidate slave is elected, it will become the new master and will assume responsibility for synchronizing the network's remaining clocks.
The rate that the synchronization messages are sent out using TSP is very slow (e.g., on the order of once every several minutes). Therefore, if the master does fail, it may take several minutes for a slave to respond and conduct an election. This characteristic of TSP can result in an unacceptably slow assumption of master tasks.
The TSP protocol functions well in the context of supporting messages and elections that occur as part of a clock synchronization program. However, its portability to other types of applications is limited, and it is not well adapted to inclusion in modern systems for several reasons.
In modern networked computer systems, each machine may be capable of simultaneously running multiple tasks and applications, each of which is executed using a master-slave configuration. In addition, modern CPUs more efficiently process messages that have 32-bit message formats. The primary limitation of TSP's application to modern systems is that TSP is capable of supporting message transfer for only a single daemon (e.g., a time daemon) per machine. In addition, TSP is not adapted to work in an environment where multiple, redundant networks are available to interconnect nodes. Finally, the TSP message format is byte oriented, and is less efficient with modern CPUs. Therefore, TSP is not an acceptable protocol for providing messaging and election capabilities for modern networked computer systems.
What is needed is a protocol and method that can rapidly respond to a master failure (e.g., within less than a second). Also needed is a protocol and method that can support messaging and elections for multiple tasks and applications running on each node of a networked computer system. In addition, what is needed is a protocol and method that support 32-bit message formats, which better utilizes the advanced capabilities of modern CPUs. Further needed is a protocol and method that can be used in the context of redundant networks.