Collective communication operations involve several processes at a time if not all. Collective communication operations such as MPI (Message Passing Interface) broadcast, which broadcasts data to all the processes in the communicator, and MPI allreduce, which performs reduction operations, are important communication patterns that can often limit the performance and scalability of applications. Thus it is desirable to get the best possible performance from such operations.
BlueGene/L systems, massively parallel computers, break up a long broadcast into several shorter broadcasts. The message is broken up into disjoint submessages, called colors, and the submessages are sent in such a way that different colors use different link on the 3D (dimension) torus. In this way, a single broadcast in 1 dimension of a torus could theoretically achieve 2 links worth of bandwidth (with 2 colors), a 2 dimensional broadcast could achieve 4 links worth of bandwidth, and a 3 dimensional broadcast could achieve 6 links worth of bandwidth. On those systems, however, there is no DMA engine and instead, processors are responsible for injecting and receiving each packet. Accordingly, what is desirable is a method and system that can utilize features of a DMA engine and network so as to achieve high throughput large message collectives. It is also desirable to have a method and system that utilizes those features to realize low latency small message collectives.