In parallel programs, a plurality of processes/threads (hereinafter referred to process) work concurrently in coordination to finish a processing task. During their operation periods, a plurality of processes is required to use synchronization primitives to achieve the synchronization among a plurality of processes. A synchronization primitive is a key primitive for ensuring the correctness of parallel programs. The primitives containing the meaning of synchronization used frequently in parallel programs are Lock and Barrier. FIG. 1 is a schematic diagram illustrating an exemplary of the realizing a sequence of read/write by using Barrier synchronization primitive in a parallel program. As shown in FIG. 1, the synchronization primitive Barrier ensures that a value read by the process P2 must be the value written by the process P1 when the process P2 reads the value of a variable A. The primitive Lock is used generally in scientific computing to ensure an exclusive access to a certain resource among a plurality of processes. Its realization generally relies on special instructions, such as the typical LL/SC instruction, provided by a processor.
Besides the primitives, such as Barrier and Lock, containing synchronization meaning only, there are also some frequently used implicit synchronization primitives in parallel programs, such as Reduce or All-Reduce. The primitive Reduce can be expressed simply as Reduce (Root, Ai, Op, Com), where Root expresses the root process of the present Reduce calculation, Ai expresses a source data associated with Reduce in the process i, Op expresses the calculate mode of Reduce, commonly “accumulation”, “descrease”, “maximizing”, and “minimization” etc., and Com symbols a set of processes participating in this Reduce. Reduce (Root, Ai, Op, Com) has the following meaning: data Ai of each process i in the set Com all uses the Op mode for calculation and returns the results of the operation to Root. In this Reduce calculation, it is implicated that all processes in the set Com are synchronous with the process Root, i.e. the process Root may not get the final operation result until all processes in the set Com have reached at a certain time point, and when the synchronization has been realized, a data transfer between processes is then realized too, in the same time. The difference between All-Reduce and Reduce lies only in that for All-Reduce, the final calculation result is broadcasted to all processes in the set Com rather than to the process Root only. Without a special explanation in the following text, both All-Reduce and Reduce are referred to Reduce.
Using software to realize the above-mentioned synchronization primitives in the prior art has a good flexibility but with a low execution efficiency, mainly expressed as, large start cost, slow execution speed and too much inter-process communications. For example, same to the method employed by a counter, the Barrier realized with software may use a shared counter A, and A is initialized to 0 by the process Root, and then each process participating in Barrier executes an operation A=A+1, and reads the value of A continuously in loop. When the value of A is equal to the total number of processes participating in Barrier, it indicates that all processes have reached in synchronization. However, this method of the software realization has two drawbacks, one is that when A=A+1 is executed, counter A may be operated by a plurality of processes simultaneously since A is shared, therefore each process will ensure that its own operation is an atomic operation. Thus either a lock technique or a method of locking memory bus method will be applied to ensure an atomic operation, which is time-consumed and will influence the processor's performance; the other lies in that when each process reads the value of A in loop, since A is allocated in a memory of a certain processor, in the case of multi-processors, if a Cache coherence is ensured among the multiple memories, it will cause an exchange of a large amount of Cache coherence control information among processors, whereas if there is no Cache coherence guarantee, a large amount of remote Load operations will be raised during the loopread of A value, and in any case, it will cause a great occupation of a large amount of communication bandwidth of multiprocessor, influencing thus the system's performance.
A software-based Reduce algorithm is similar to the above-mentioned Barrier. Besides calculating if all processes have reached at a synchronization point, in the software-based Reduce algorithm, the data of each process should also be calculated and the result is put into a variable Value of a shared memory. The data of the process 0 is assumed as Value0, the data of the process 1 is assumed as Value1 . . . , and the data of the process N is assumed as ValueN. The Process Root initializes the Value according to the operation type of Reduce, e.g. when the operation type of Reduce is “maximizing”, the Value is then initialized as the minimal value that a computer can expresses, and each process n performs the following operation subsequently:If (Value n larger than Value)Value=Value n;
Similarly, each process is required to ensure the atomicity of the above-mentioned operation. When all processes have finished calculations through Barrier-mentioned counter A, the final value of Value is the maximal one of all processes' data, and each process can then read the value of the Value, i.e. the Reduce with the operation type “maximizing” is finished.
Similar to Barrier, using software to realize the operations Reduce and Lock among a plurality of processors has the same problem. Although using some improved algorithms by the software may reduce the above-mentioned shortcomings, the problem cannot be completely solved. The problems such as slower execution speed and the cost of processor's execution resource still exist.