In known computer systems, a message passing interface barrier (MPI barrier) is an important collective synchronization operation used in parallel applications or parallel computing. Generally, MPI is a specification for an application programming interface which enables communications between multiple computers. In a blocking barrier, the progress of the process or a thread calling the operation is blocked until all the participating processes invoke the operation. Thus, the barrier ensures that a group of threads or processes, for example in the source code, stop progress until all of the concurrently running threads (or processes) progress to reach the barrier.
A non-blocking barrier can split a blocking barrier into two phases: an initiation phase, and a waiting phase, for waiting for the barrier completion. A process can do other work in-between the phases while the barrier progresses in the background.
The collection of the processes invoking the barrier operation is embodied in MPI using a communicator. The communicator stores the necessary state information for a barrier algorithm. An application can create as many communicators as needed depending on the availability of the resources. For a given number of processes, there could be exponiential number of communicators resulting in exponential space requirements to store the state. In this context, it is important to have an efficient space bounded algorithm to ensure scalable implementations.
For example, on an exemplary supercomputer system, a barrier operation within a node can be designed via the fetch-and-increment atomic operations. To support an arbitrary communicator, an atomic data entity needs to be associated with the communicator. As discussed above, making every communicator contain this data item leads to storage space waste. In one approach to this problem, a single global data structure element is used for all the communicators. However, as discussed in further detail below, this is inefficient as concurrent operations are serialized when a single resource is available.
In one embodiment of a supercomputer, a node can have several processes and each process can have up to four hardware threads per core. MPI allows for concurrent operations initiated by different threads. However, each of these operations needs to use different communicators. The operations are serialized because there is only a single resource. For all the operations to progress concurrently it is imperative that separate resources need to be allocated to each of the communicators. This results in undesirable use of storage space.
One way of allocating counters is to allocate one counter for each communicator as different threads can only call collectives on different communicators as per the MPI standard. Then, the counter can be immediately located based on a communicator ID. However, a drawback of the above approach results in inferior utilization of memory space.
There is therefore a need for a method and system to allocate counters for communicators while enhancing efficiency of utilization of memory space. Further, there is a need for a method and system to use less memory space when allocating counters. It would also be desirable for a method and system to allocate counters for each communicator using the MPI standard, while reducing memory allocation usage.