The present application relates to the optimization of collective communication in message passing interface (MPI) applications with multiple processes running on a compute node, in which for example, all the compute nodes may be connected by a fast interconnection network.
Large message collectives such as MPI_Broadcast (MPI_Beast) and MPI_Allreduce in an application running more than one process/node over an interconnection network, for example massive supercomputers, use an intermediate shared buffer for these operations. The drawbacks for using intermediate shared buffers may be additional copy costs and complexity of managing the intermediate buffers. For example, in MPI_Beast, the root of the operation first copies the data into a shared memory segment. The network and other processes local to the root node read the data from this shared memory segment. The data is received into a shared memory buffer at all the destination nodes followed by the processes reading the data from their respective local buffers. This incurs copy-overheads at the sending and receiving nodes. Also, the size of the buffer employed may be less than the application buffer. To avoid buffer overruns, additional mechanisms may be needed to effectively control the injection flow.
Obtaining good throughput, for example, for medium to large message sizes entails effective pipelining between different phases of the operation such as network and shared memory, shared memory and shared memory. Most of the current techniques use explicit synchronization in the form of flags, locks to verify whether data has been read or written. Apart from the overheads, it is difficult to achieve fine grain pipelining with these techniques. Moreover, on torus networks such as IBM™ Blue Gene™, data arrives from more than one link leading to a collective comprising of multiple streams of data flowing in/out of a given node.