In a multiprocessor system, synchronization between multi-nodes is frequently needed to synchronize all nodes. Synchronization is used to indicate that calculation on all nodes has reached a certain point. When synchronizing the nodes in a multi-node system, no node can continue processing until all nodes reach the synchronization point. This approach is used when, for example, partial results are calculated on all nodes in one phase of calculation, and then all partial results have to accumulate into a global result which is needed in the following phase of calculation. It is also used when successive phases of a calculation need to proceed in lock step across all nodes.
Each node in a multi-node system can have one or more processor cores. One or more processor cores can be located on the same chip (i.e., integrated circuit die). The organization of cores into nodes varies across machine architectures. Also, on a single processor core, one or more processing threads can be active. Sometimes a communication task (often MPI) is mapped to a single core, other times it may be mapped to multiple cores on a node, and still other times it may be mapped to the whole node. The scope of the present disclosure includes mechanisms that work regardless of the number of cores per node or the mapping communication tasks to cores.
On way to synchronize across all nodes in a multi-node system may proceed in two steps:                1) all cores within the chip are synchronized to ensure that all processing threads/cores on the chip have reached the synchronization point;        2) all chips within the system are synchronized.        
Prior work implements this two-step synchronization process. In the first step, cores on a single chip are synchronized, and one core is assigned as the “winning” core. In the second step, intra-chip synchronization barrier is formed by synchronizing all “winning” cores on all chips.
An example of such system is the Blue Gene/P system, where lock box synchronization primitives are used to determine the winning core on a chip, and then inter-chip synchronization is achieved using a dedicated one bit network. The Blue Gene/Q system uses an improved and scalable mechanism to synchronize all cores on a chip, and synchronization between the chips is performed by using the system network and sending packets between the chips.
For a Cell chip, barrier synchronization between one master processor core PPE (power processing element) and eight accelerating processors cores SPEs (Synergistic Processing Elements) is implemented as a software program without using any dedicated hardware features to support synchronization. To achieve on-chip synchronization, all SPEs can add and write into the same memory location. The master processor on the chip, PPE can poll that memory location to determine when on-chip synchronization is achieved.
Other multi-node systems use a BSR (barrier synchronization register), where each processor has a one bit barrier write register. Logically, these all write bits form a single BSR register. All processors write into their bits, and all processors can read all the bits of the register. When a processor reaches barrier, it writes its barrier bit. All or only one processor polls on the all bits of the BSR register to determine whether the other processors reached synchronization. Barrier synchronization for on-chip and off-chip synchronization by using a BSR register introduces overhead to at least one processor, which needs to poll the BSR register until all processors reached the barrier. To poll a register, a number of instructions has to be to executed to determine that synchronization is achieved and communicate this status on-chip and/or off-chip, resulting in a power consuming, energy-inefficient system and causing long latency for synchronization. In addition, this approach requires asymmetric software implementation to be executed on various processors on the chip, even if all processors on the chip are identical.