Parallel computing systems such as computer clusters, Symmetric Multiprocessor (SMP) computers and other architectures are becoming increasingly affordable and cost-effective for a wide range of applications. Many of these systems are built from commodity and relatively inexpensive computer parts connected with high speed networks (such as LANs, proprietary interconnects, etc.) or System Area Networks (SANs). These types of computing systems compete with the historically expensive custom-built supercomputers.
Most programming models provide for writing parallel applications, and many parallel applications take advantage of collective/distributive communication operations. Moreover, many scientific applications rely exclusively on collective/distributive operations for their communication. Thus, providing a high performance and scalable collective/distributive communication method for broadcasting information is critical for the operation and success of commodity based parallel computing systems.
A key issue in providing a high performance and scalable collective/distributive communication method is to reliably and quickly broadcast a message from a root process to multiple other processes in a group. Current methods need at least O(log n) steps to deliver a message reliably to n nodes. This means the effort and time to reliably broadcast data increases logarithmically with the number of nodes.
FIGS. 1A to 1C are block diagrams for illustrating in more detail the conventional approaches to broadcasting data in a parallel computing scheme.
As shown in FIG. 1A, the simplest method to broadcast data is to send it one by one to each recipient in the group. With n total number of nodes, this method will take n sends and n receives. The number of steps is thus O(2n).
A straightforward improvement to the method in FIG. 1A is to use the broadcast/multicast capability of certain interconnects (such as Ethernet, Infiniband, etc.). However, such broadcast and multicast protocols support only unreliable datagram services which do not guarantee reliable data delivery. To take advantage of hardware supported broadcast/multicast capability therefore, an additional reliability protocol needs to be in place. All known approaches and prior art implement this reliability protocol as either (1) a simple method where every recipient sends an acknowledgement directly back to the root or (2) a reverse tree-based confirmation method where the leaves start sending the confirmation to co-roots and finally informing the main root that the message was delivered.
For example, in connection with the first improvement as shown in FIG. 1B, the root node sends data using multicast, while the other nodes are waiting for it. If the message is received, an acknowledgement (ACK) is sent back to the root node. The root blocks and waits for all ACKs to be received. If not all ACKs arrive within a certain period of time, it times out and re-transmits the message. This method is O(n+1), where sending is O(1) and returning the confirmations is O(n).
The second improvement takes advantage of tree-based algorithms, which are based on point-to-point communication operations. Similar to a snow-ball principle, the root sends to multiple nodes, who in return, send each to multiple nodes, until the leaves are reached. In a tree-based method, the number of steps to reach leaf nodes increases with the total number nodes n typically in a logarithmic manner O(log n).
More particularly, as shown in FIG. 1C, contrary to the scheme in FIG. 1B, after the multicast by the root sender, to avoid the serial ACK hitting the sender all at the same time (a condition known as “ACK implosion”), a hierarchical structure is used for ACK collection and thus distributes the load to a number of nodes. In a tree based structure to collect ACKs, all nodes form a tree structure with the root node being the root of the tree. Intermediate nodes are responsible for collecting ACKs for their children. A variation is called the co-root scheme where, in addition to the root node, a subset of other nodes are selected as co-roots which receive the data in a reliable manner. The remaining nodes are called leaf nodes. Each of the root and the co-roots is responsible for a group of leaf nodes and performs as described above. These types of methods are using O(log n) steps for n number of nodes.
Although the tree-based scheme improves throughput, delays still increase as the number of nodes increase (although by a slower logarithmic factor of log n).
And there are sources for additional delays even in a tree-based implementation. For example, broadcasting is typically implemented as a blocking operation to ensure that the operation does not return at the root node until the communication buffer can be reused. Accordingly, if a message gets lost in the communication network, the reliability protocol will re-send the message and the communication buffer has to keep the original message until the last node has confirmed receipt. For a receiving node, the operation returns only after the broadcast data has been delivered to the respective receive buffer. This blocking requirement can therefore introduce significant delays, especially as the number of nodes increase and the chances of communication interruptions increase correspondingly.
Another disadvantage of tree-based methods is that if intermediate nodes that are expected to forward broadcast data are busy, they delay the forwarding which has an adverse impact on the execution time of an application.
Accordingly, it would be desirable if there were a reliable method that needed only a fixed number of steps for distributive/collective communication functions in parallel computing systems and would allow the use of simple commodity based cluster computers to achieve similar performance compared to a custom-built supercomputer.