In general, synchronization happens within a group of communicating threads where all threads need to reach the same synchronization point before any thread can proceed beyond that point. Any given thread is allowed to be in multiple groups, but it can issue only one synchronization for one group at a time. Existing solutions allocate separate state space for each group. This is a straight forward approach but has limitations when used in connection with Barrier Synchronization Register (BSR) hardware which has limited resources in terms of the number of bits available, especially when the number of groups to which a thread belongs is large. The present invention is directed to a solution of this problem.
Parallel processing that distributes work among multiple concurrent processes requires synchronization between processes. One common method of providing this synchronization is via so-called barrier synchronization. By definition, a barrier involves a group of threads. Once a thread enters the barrier, it waits for all other members of the same group to enter the barrier before it exits from the barrier. Threads can also have different organizational structures. For example, a group of threads may have a single central parent or root thread that is responsible for polling to see if all of the children threads have entered the barrier. The threads may also be organized in a tree structure with each thread in the tree being responsible for its children threads. Such trees are not restricted to being binary trees.
Barrier synchronization is typically used in parallel computing environments in which complex numerical applications are being executed. As is known, when an application is processed in a parallel fashion, various jobs for the application are processed in parallel. Barrier synchronization provides a checkpointing mechanism that ensures that each job reaches a particular point before proceeding.
On a cluster of SMP (Symmetric Multi-Processing) data processing systems, barrier synchronization processes are typically divided into two steps. The first step synchronizes participants on the same node and the second step synchronizes all participants across multiple nodes. This division is based on the fact that there are usually faster methods like shared memory and the use of a BSR (barrier synchronization register) to speed up synchronization within a node rather than going through cluster interconnections for off-node synchronization. The present invention focuses on improving the first step of barrier synchronization processes.
Approaches, other than those employed herein, for on-node synchronization employ a shared memory approach. However, the performance of shared memory is subject to the overhead of cache coherence. In part to avoid this problem, the present invention takes advantage of a special-purpose register (BSR or Barrier Synchronization Register) built into the hardware to speed up barrier operations. It is faster than shared memory but, by the very nature of registers, it has a limited size, typically in the range of tens of bytes.
A BSR is best viewed as a distributed register that is accessible by all of the CPUs (Central Processing Units) on a node. Logically, there is only one BSR having a certain number of bytes. Physically, each CPU has a local copy of the BSR. All loads from the BSR are local to the CPU issuing the loads. All stores to the BSR by any CPU are broadcast to all other CPUs. The software is responsible for the correctness of concurrent stores to the same BSR byte. All loads and stores are cache inhibited to avoid cache coherence cost, so as to provide fast synchronization by using the BSR.
One possible way to use the BSR is to assign one BSR byte per thread in synchronization. Each thread puts its own phase number in its BSR byte and polls on other BSR bytes until their values are no less than the phase number, then the phase number can be incremented for the next synchronization. It is noted that while the present invention is described in terms of the use of a special register referred to as a Barrier Synchronization Register, the methods herein are capable of employing any conveniently available an allocatable region of memory.
In general, when a communication protocol MPI (Message Passing Interface), for example, is involved, one thread may belong to multiple groups and barrier synchronization may happen within multiple disjoint groups concurrently. A barrier method based on shared memory can allocate a separate state for each different groups. However, the challenge for using the BSR effectively arises from the limited size of BSR. In order for barrier methods based on using the BSR to be most efficient, the method should be able to handle multiple concurrent barriers and should be able to allow most barrier operations to use the BSR instead of a slower backup method, while only providing a few bits of state information for each participant.
Synchronization in general happens within a group of communicating threads where all threads need to reach the same synchronization point before any thread can proceed beyond. One thread is allowed to be in multiple groups but it can only issue one synchronization for one group at a time. Existing solutions allocate separate state space for each group; this is straightforward but does not work in a limited state space such as with a barrier synchronization register (BSR) when the number of groups threads is large. The present invention solves the synchronization problem using limited state space, yet supports an almost unlimited number of synchronization groups.