The present invention is concerned with multiprocessor system in which the processors operate together to perform coordinated calculations; more specifically, the present invention is concerned with methods and apparatus for clock synchronization among the processors.
A system called the Supercomputer Toolkit is known in the art and in the literature. Reference is made to H. Abelson, et al., xe2x80x9cThe Supercomputer Toolkit: A general framework for special-purpose computing,xe2x80x9d International Journal of High Speed Electronics, Vol. 3, Nos. 3 and 4, pp 337-361, 1992, and J. Katzenelson, et al., xe2x80x9cThe Supercomputer Toolkit and Its Applications,xe2x80x9d EE Memo 1165, Department of Electrical Engineering, Technion, June 1998.
The Supercomputer Toolkit is a family of hardware and software modules from which high-performance special-purpose computers for scientific/engineering use can be easily constructed and programmed. The hardware modules include processors, memory, I/O and communication devices; the software modules include an operating system, compilers, debuggers, simulators, scientific libraries, and high-level front ends.
The following example illustrates the use of the Supercomputer Toolkit. When faced with a suitable problem, the engineer/scientist connects the modules by means of static-interconnect technology (ribbon cables) and constructs a parallel computation network. The network is loaded from a workstation that serves as a host. The program is run; the results are collected and displayed by the host. The host handles files, does compilation, etc. The computation network, the hardware portion of the Toolkit, does the heavy computation.
An interesting and important characteristics of the Toolkit""s parallel computation network is its high-speed communication links. In general, the Toolkit comprises a plurality, often very many processor boards. Each of the links connects between two or more processor boards. The computation network can be viewed as a graph where the nodes (vertices) are processors (each processor is a board) and the branches (arcs) are the high-speed links. This structure is described, for example in the second reference cited above. This reference reports maximum communication rate of 0.5 Gigabit per second per link (or per processor port) and claims that with better design the maximum rate per link is the rate in which data can be retrieved/stored from/in memory.
The high speed of the communication is based on the synchronization of all processors in the computation networks. This synchronization means the following:
a. The clocks of all processors have the same frequency. This is referred to herein as xe2x80x9cfrequency sychronizationxe2x80x9d.
b. The phase difference between the clocks of an two neighbors is less than a certain amount called xcex4, which is the tolerance or the phase difference that allows bidirectional synchronous communication between adjacent neighbors. This condition is referred to herein as xe2x80x9cphase synchronizationxe2x80x9d.
c. The instructions of all neighbors have to be synchronized in the sense that if a processor sends data to a second processor to which it is connected (and phase synchronized) by a xe2x80x98writexe2x80x99 instruction, a xe2x80x98readxe2x80x99 instruction has to appear in the second processor""s program at the exactly the right place that enables reading the information sent to it. This condition is referred to herein as xe2x80x9cinstruction synchronization.xe2x80x9d
Two Toolkit systems have been implemented and published. One of these, referred to herein as the xe2x80x9cMIT Toolkitxe2x80x9d is described in the above referenced Abelson et al. reference; the other is referred to herein as the xe2x80x9cTechnion Toolkitxe2x80x9d is described in the Katzenelson, et al. reference.
The two Toolkit systems implemented frequency synchronization by having one clock generator whose signal is distributed to all processor boards. Phase synchronization was achieved by hand trimming of the length of the wire that distributes the clock. Instructional synchronization was achieved somewhat differently in MIT Toolkit and the Technion Toolkit. The MIT Toolkit has data independent instructions (i.e., the time required to carry out an instruction is independent of the data) and therefore, if all processors are started together they remain synchronized. The Technion Toolkit relied on a synchronization line, as described in the Abelson, et al., reference. The MIT Toolkit also had a synchronization line that could be used for instruction synchronization. Note that both Toolkit systems are meant to run one program at a time; that program is parallelized and put on all processors. Thus, the support of instructional synchronization is relatively simple for a phase synchronized Toolkit system, while for a general purpose distributed computing system that supports several independent programs at a time the support of instruction synchronization is not straight forward, to say the least.
In the Toolkit systems the processors are inter alia arranged in a general network, i.e., which may include a series of ring or loop configurations. In this manner these systems different from synchronized systems of the prior art, in which a tree structure is utilized. In such tree structure systems, the synchronization between one processor and a second adjacent processor is relatively independent of the synchronization of the adjacent processor and other processors to which it is connected. For loop connected processor systems, the phase synchronization between adjacent processors must be preserved all around the loop. For large or complicated loop systems, the achievement of phase synchronization is difficult to achieve by hand.
One major drawback of the existing synchronous systems is thus seen to be the requirement to hand-trim the clocks to achieve the phase synchronization. Furthermore, this requirement limits the use of the Toolkit to systems in which there is no change in the timing of the clocks or of the transit time between the processors.
Communication engineers are well aware that, given the maximum speed of the clocks, synchronous communication is the best type of connection between processors. In most common communication environments, however, frequency, phase and instruction synchronization cannot always all be satisfied. Therefore, sophisticated asynchronous algorithms have been developed that approach the performance of synchronous communication as the length of the message are increased. These asynchronous methods require instrumentation and/or introduce latency.
One aspect of some preferred embodiments of the present invention is concerned with a systematic method for providing phase synchronization between clocks of neighbors in a network containing loops. Preferably, the phase synchronization is achieved automatically.
One aspect of some preferred embodiments of the invention is concerned with reducing the phase difference between adjacent processors. If the phase difference between any two processors in the network is within xc2x1T/4 (where T is the period of the clock), a preferred method of the invention reduces the phase difference between any pair of neighbors to less than a desired amount xcex4. In a preferred embodiment of the invention, this starting point is achieved by increasing T until the condition is met. Preferably, the phase difference is then decreased, and T decreased to a desired clock rate.
When implementable, synchronous communication is both fast and simple. Note that implementing the method depends on the ability to measure phase differences between neighbor processors; that ability requires transmission lines of substantially time-invariant delay. For a Toolkit-like supercomputer, where a few dozens (or few hundreds or more) processors occupy a room, such transmission lines are available and the conditions can be satisfied. Synchronous transmission may be achieved with stable or with slowly varying phase differences. When the phase between processors varies slowly, the system can track the changes and continuously adjust the phase between the clocks. Alternatively, at predetermined intervals or in response to an indication of increased phase difference, the system can be halted and a new synchronization cycle.
The phase difference is preferably reduced by changing the phase of at least some of the processors xe2x80x9clocallyxe2x80x9d, defined herein as xe2x80x9cat the processorxe2x80x9d. In a preferred embodiment of the invention, the processor itself performs this command in accordance with a predetermined protocol. The processor may receive commands for making the phase change either from a central computer (which receives measurement data from each processor and sends the processors commands based on the protocol), or from another processor. Alternatively, each processor may decide local on the required phase change.
There is thus provided, in accordance with a preferred embodiment of the invention, a method of synchronizing a general network of processors, which network may contain loops, the method comprising:
(a) providing a clock signal at each of the processors, said clock signals having a common frequency and different phases;
(b) determining the phase between clock signals of different processors at the processors; and
(c) adjusting the phase of the clock signals to produce local clock signals at each processor by locally varying the phases of the individual clock signals, responsive to the measurements, such that phase differences between the clock signals of adjacent processors are less than a predetermined value xcex4.
Preferably, the determination of the phase differences is made by a measurement at the processors. Alternatively it may be determined in other ways.
Preferably, the phases of the individual clock signals for the processors are varied by their associated processors.
Preferably, the clock signal is provided to the processors from a remote clock source via transmission lines of different lengths, such that the phase of the provided clock signals at an input to the processors is different for at least some of the different processors.
Preferably, the phase measurements are transmitted to a computer and wherein the computer determines required phase changes for the processors in accordance with a predetermined protocol.
Preferably, the protocol comprises:
A) determining, for a first processor, whether the phase difference between its local clock and the clocks of all its neighboring processors is greater xcex4 in a same sense when measured at the first processor; and
B) stepwise varying the phase at the first processor by an amount less than xcex4, in a direction to reduce the phase differences, until at least one phase difference between the first processor and a neighbor, measured at the first processor, is less than xcex4.
C) repeating A) and B) for all the processors.
Preferably, after the method includes, after C:
D) determining the processor (i) for which measurements of all the neighboring clocks differ from the local clock in a particular same sense and (ii) which has the most deviations in that direction that are greater than xcex4;
E) changing the phase by an amount smaller than xcex4, in an amount to reduce the maximum measured phase difference; and
F) repeating D) and E) until all measured phase differences for all the pairs of neighboring processors is less than xcex4.
Preferably, the computer informs the processors of a required phase change in accordance with the protocol and wherein the processors vary the phase responsive to the informing.
In a preferred embodiment of the invention, the method includes, prior to (a), removing cycle jumps in all loops of the system. Preferably, removing cycle jumps comprises:
(i) choosing a spanning tree of the network, arranged in hierarchical levels;
(ii) adjusting the phase of the clock signals in a first descendant level of the processors to be different from the clock of a root processor of the network by less than a given amount, in a first time sense;
(iii) adjusting the phase of the clocks in a second descendant level of the processors to be different from the clock of their immediate ancestor by less than the given amount in a second time sense opposite from the first sense;
(iv) repeating (ii) and (iii) for succeeding descendant levels of the processors until the end of the tree is reached.
Preferably, the method includes:
prior to (i), reducing the frequency of the clock signal; and
subsequent to (iv), returning the frequency of the clock signal to its original value.
Preferably, the first time sense is a lag and the second time sense is a lead. Alternatively, the first time sense is a lead and the second time sense is a lag.
There is further provided, in accordance with a preferred embodiment of the invention, a method of eliminating phase cycle in a loop connected system, the network comprising processors having a clock with at least a variable phase, the method comprising:
(ii) adjusting the phase of the clock signals in a first descendant level of the processors to be different from the clock of a root processor of the network by less than a given amount, in a first time sense;
(iii) adjusting the phase of the clocks in a second descendant level of the processors to be different from the clock of their immediate ancestor by less than the given amount in a second time sense opposite from the first time sense;
(iv) repeating (ii) and (iii) for succeeding descendant levels of the processors until the end of the tree is reached.
Preferably, the method includes, prior to (i):
reducing the frequency of the clock signals.
Preferably, the method includes, after (iv):
(v) measuring the phase between clocks at the processors; and
(vi) locally varying the phases of the individual clock signals, responsive to the measurements, such that phase differences between the clock signals are less than a predetermined value.
Preferably, the method includes, after (vi):
(vii) returning the clock signal frequency to its original value.
(viii) performing the method of claim 1.
Preferably, the first time sense is a lag and the second time sense is a lead. Alternatively, the first time sense is a lead and the second time sense is a lag.
There is further provided, in accordance with a referred embodiment of the invention, a method of synchronizing a general network of processors, which network may contain loops, the method comprising:
(a) providing clock signals at each of the processors, said clock signals having a common frequency;
(b) measuring the phase between clock signals of different processors at the processors; and
(c) adjusting the phase of the clock signals to produce local clock signals at each processor by varying the phases of the clock signals of individual ones of the processors, responsive to the measurements, such that phase differences between the clock signals of adjacent processors are less than a predetermined value xcex4, wherein such adjustment is controlled by at least one computer.