This invention relates generally to timekeeping within computer networks. More particularly, this invention relates to a system and method for synchronizing the real time clocks within the nodes of a computer cluster.
Computer clusters are an increasingly popular alternative to more traditional computer architectures. A computer cluster is a collection of individual computers (known as nodes) that are interconnected to provide a single computing system. The use of a collection of nodes has a number of advantages over more traditional computer architectures. One easily appreciated advantage is the fact that nodes within a computer cluster tend to fail independently. As a result, in the event of a node failure, the majority of nodes within a computer cluster may survive in an operational state. This has made the use of computer clusters especially popular in environments where continuous availability is required.
A fundamental problem with clusters is that the computer clock of each cluster node generally drifts away from the correct time at a different rate. The rate at which a clock drifts is typically measured in parts-per-million (ppm). For example, the clocks used within the Tandem NonStop_UX S4000 computer series are specified to have a drift of less than 25 ppm. This makes the clock of these systems accurate to approximately 2 seconds per day. Without a correction mechanism, the clocks within a computer cluster will eventually drift far enough that applications that expect synchronized time may begin to work incorrectly.
Several methods have been developed to reduce node-to-node clock differences in computer networks and clusters. One simple method is to set the clock of each node at boot time. This method is useful for reducing large node-to-node time differences. Setting clocks at boot time does little however, to reduce inaccuracies due to clock drift. Thus, each clock may start at the correct time, but time across the cluster will become increasingly inaccurate over time. A second method for reducing node-to-node clock differences is to periodically synchronize the time of each node against a master clock. If the time between synchronizations is small, each clock will only experience a limited between-synchronization drift. As a result, total node-to-node differences between clocks can be reduced to tolerable limits.
Protocols for synchronizing time against a master clock must account for the propagation delays that exist between the node where the master clock is located (the master node) and the nodes that are to be synchronized (the slave nodes). Otherwise, the clock of each slave node will lag behind the clock of the master node by an amount that is approximately equal to the propagation delay to that slave node. In cases where computers are connected using Ethernet-type networks, a relatively simple mechanism exists for accurately calculating propagation delays. To use this mechanism, the master node sends a message to a slave node. The slave node then responds with an acknowledgment message. The master node synchronizes time by sending a message including the sum of the propagation delay and its current clock time to the slave node.
The simple mechanism used to calculate propagation delays in Ethernet-type networks works because nodes in these networks use a single connection for sending and receiving messages. The use of a single connection means that the propagation times to and from a node are approximately equal. This allows the propagation delay to a node to be computed as round trip time divided by two. Unfortunately, there are highly desirable network types that do not provide the same uniformity of sending and receiving propagation delays. Networks of this type include Tandem Computer""s Servernet products. Each node in a Servernet network has separate network connections: a first for sending and a second for receiving. Separate connections means that the propagation delays to and from a node may not be the same. This makes the mechanism used in Ethernet-type networks unsuitable for use in networks like Tandem""s Servernet.
Based on the preceding discussion, it is not hard to appreciate that a need exists for time synchronization systems that are suitable for use in networks where the Ethernet simplification does not apply. There is also a need for new or extended time synchronization systems that fulfill a range of other currently unmet needs. For example, currently available time synchronization systems often fail when faced with significant clock frequency errors. Currently available time synchronization systems may also fail when faced with heavily loaded or congested works. Both of these failures indicate that currently available time synchronization systems lack the ability to provide the type of fault-tolerant operation. Currently available time synchronization systems may also be require the network to process large numbers of synchronization messages. Large numbers of synchronization messages steals network bandwidth from other computing tasks.
Thus, there is a need for fault tolerant techniques that synchronize system clocks across the nodes of a cluster that have minimal affect on, and are minimally affected by, communication traffic throughout the cluster.
An embodiment of the present invention includes a system for time synchronization in a computer cluster and scheduling time changes across the cluster. The system of the present invention uses a repeating update cycle. During the first part of this cycle, a master node within the SSI cluster contacts each of the slave nodes within the SSI message. The SYNC message includes a first time stamp indicating the time at which the message was sent. The slave node adds a second time stamp and returns the SYNC message to the master node. The master node then adds a third time stamp to the SYNC message. Using the three time stamps, the master node determines if the time clock within the slave node leads or follows the time clock in the master node. The calculation does not depend on the assumption that transmission delays to the slave node are the same as the transmission delays to the slave node are the same as the transmission delays from the node. If the time clocks do not match within the specified tolerance, the master node sends an INFO message to the slave node. The INFO message specifies a time adjustment for the time clock within the slave node. If a cluster wide time change is required the master node will send an INFO message to all slave nodes that includes each specific master/slave difference and the adjustment for the entire cluster. The INFO message also includes a time at which the specified time adjustment is to be applied.
During the second portion of the update cycle, each slave node applies the time adjustment specified by the master node (if the master node specified a time adjustment for that slave node). Large adjustments gradually advance or retard the time clocks within the slave nodes. Each adjustment begins at the same time (i.e., the time specified by the master node). Small adjustments are applied immediately. The update cycle then repeats with another sequence of SYNC and INFO messages followed by scheduled time adjustments.
Advantages of the invention will be set forth, in part, in the description that follows and, in part, will be understood by those skilled in the art from the description herein. The advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.