Large and complex computing tasks are sometimes carried out by distributed computer systems, which comprise multiple interconnected compute nodes. Parallelizing a computing task among multiple compute nodes helps to reduce task execution time, and enables execution of large computing tasks in reasonable run times.
Some known communication protocols and Application Programming Interfaces (APIs) have been designed particularly to facilitate task execution over multiple distributed compute nodes. An example protocol is the Message Passing Interface (MPI), which is specified by the MPI forum in “MPI: A Message-Passing Interface Standard,” version 2.2, Sep. 4, 2009, which is incorporated herein by reference. In particular, chapter 5 of the MPI standard specifies collective communication functions involving a group or groups of software processes. Several techniques for performing collective functions are known in the art. Example methods are described in U.S. Patent Application Publication 2009/0063816, whose disclosure is incorporated herein by reference.
Coti et al. propose a programming method and a modification of the MPI runtime environment that take Wide Area Network (WAN) topology into consideration, in “MPI Applications on Grids: a Topology Aware Approach,” Proceedings of the 15th International European Conference on Parallel and Distributed Computing (EuroPar'09), Delft, the Netherlands, August 2009, which is incorporated herein by reference.
Petrini et al. describe a network interconnection scheme called the Quadrics Network (QsNet), in “The Quadrics Network (QsNet): High-Performance Clustering Technology,” Proceedings of the 9th IEEE Symposium on Hot Interconnects (HotI'01), Palo Alto, Calif., August 2001, which is incorporated herein by reference. QsNet integrates a programmable processor in the network interface, which allows the implementation of intelligent communication protocols.
Some known techniques attempt to offload the processing of collective functions from the compute nodes. For example, Sancho et al., describe a method that dedicates some compute node processors in the system to perform only collective functions, in “Efficient Offloading of Collective Communications in Large-Scale Systems,” Proceedings of the 2007 IEEE International Conference on Cluster Computing, Austin, Tex., Sep. 17-20, 2007, which is incorporated herein by reference. As another example, U.S. Patent Application Publication 2005/0097300, whose disclosure is incorporated herein by reference, describes a dedicated hardware-based collective offload engine that provides collective processing of data from processing nodes in a distributed computing environment. The dedicated collective offload hardware engine is coupled to a switch fabric. A result is produced by the collective offload hardware engine based on the collective processing of the data, and is forwarded to at least one processing node. Collective processing is facilitated by communication among a plurality of dedicated collective offload engines via the switch fabric or via a private channel disposed between the collective offload engines.