High-performance computing (HPC) has seen a substantial increase in usage and interests in recent years. Historically, HPC was generally associated with so-called “Super computers.” Supercomputers were introduced in the 1960s, made initially and, for decades, primarily by Seymour Cray at Control Data Corporation (CDC), Cray Research and subsequent companies bearing Cray's name or monogram. While the supercomputers of the 1970s used only a few processors, in the 1990s machines with thousands of processors began to appear, and more recently massively parallel supercomputers with hundreds of thousands of “off-the-shelf” processors have been implemented.
There are many types of HPC architectures, both implemented and research-oriented, along with various levels of scale and performance. However, a common thread is the interconnection of a large number of compute units, such as processors and/or processor cores, to cooperatively perform tasks in a parallel manner. Under recent System on a Chip (SoC) designs and proposals, dozens of processor cores or the like are implemented on a single SoC, using a 2-dimensional (2D) array, torus, ring, or other configuration. Additionally, researchers have proposed 3D SoCs under which 100's or even 1000's of processor cores are interconnected in a 3D array. Separate multicore processors and SoCs may also be closely-spaced on server boards, which, in turn, are interconnected in communication via a backplane or the like. Another common approach is to interconnect compute units in racks of servers (e.g., blade servers and modules) that are typically configured in a 2D array as cluster of compute nodes.
There are various types of processing tasks that require precise synchronization across various sets of servers and/or compute nodes. For example, when deployed in a cluster, the compute nodes typically send messages between themselves, and the order that the messages are received is very important. For this reason, there are various ordering models that may employed to ensure messages are processed in the proper order, including FIFO (First-in, First-out), Total, and Causal ordering. Each of these ordering schemes requires additional overhead that results in reduced performance. For example, FIFO ordering may typically require use of FIFO routers, Total ordering requires messages to be sent through a central entity, and Causal ordering is typically implemented using vector clocks.
Ideally, the most effective ordering scheme would simply involve timestamping each message with an absolute time. This would support Absolute ordering, which is the preferred ordering scheme for many HPC and other processes. However, this is inherently difficult to implement, because there is no such thing as absolute time that is shared across an HPC environment. More accurately, it isn't so much that the time needs to be absolute, but rather the timeclocks running on each server are synchronized.
One scheme for synchronizing clocks is defined by the IEEE 1588 standards. IEEE 1588 provides a standard protocol for synchronizing clocks connected via a multicast capable network, such as Ethernet. IEEE 1588 was designed to provide fault tolerant synchronization among heterogeneous networked clocks requiring little network bandwidth overhead, processing power, and administrative setup. IEEE 1588 provides this by defining a protocol known as the precision time protocol, or PTP.
A heterogeneous network of clocks is a network containing clocks of varying characteristics, such as the origin of a clock's time source, and the stability of the clock's frequency. The PTP protocol provides a fault tolerant method of synchronizing all participating clocks to the highest quality clock in the network. IEEE 1588 defines a standard set of clock characteristics and defines value ranges for each. By running a distributed algorithm, called the best master clock algorithm (BMC), each clock in the network identifies the highest quality clock; that is the clock with the best set of characteristics.
The highest ranking clock is called the ‘grandmaster’ clock, and synchronizes all other ‘slave’ clocks. If the ‘grandmaster’ clock is removed from the network, or if its characteristics change in a way such that it is no longer the ‘best’ clock, the BMC algorithm provides a way for the participating clocks to automatically determine the current ‘best’ clock, which becomes the new grandmaster. The best master clock algorithm provides a fault tolerant, and administrative free way of determining the clock used as the time source for the entire network.
Slave clocks synchronize to the 1588 grandmaster by using bidirectional multicast communication. The grandmaster clock periodically issues a packet called a ‘sync’ packet containing a timestamp of the time when the packet left the grandmaster clock. The grandmaster may also, optionally, issue a ‘follow up’ packet containing the timestamp for the ‘sync’ packet. The use of a separate ‘follow up’ packet allows the grandmaster to accurately timestamp the ‘sync’ packet on networks where the departure time of a packet cannot be known accurately beforehand. For example, the collision detection and random back off mechanism of Ethernet communication prevents the exact transmission time of a packet from being known until the packet is completely sent without a collision being detected, at which time it is impossible to alter the packet's content.
While IEEE 1588 and the PTP provides an adequate level of clock synchronization for some applications (on the order of 10 s of microseconds), it is not precise enough to meet the needs of many HPC environments. Accordingly, it would be advantageous to implement a mechanism that maintains clock synchronization that is several orders of magnitude better than IEEE 1588.