1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method for performing collective send operations on a system area network.
2. Description of Related Art
In a System Area Network (SAN), such as an InfiniBand™ (IB) network or iWarp network, hardware provides a message passing mechanism that can be used for Input/Output devices (I/O) and inter-processor communications (IPC) between general computing nodes. Processes executing on devices access SAN message passing hardware by posting send/receive messages to send/receive work queues on a SAN channel adapter (CA). These processes also are referred to as “consumers.”
The send/receive work queues (WQ) are assigned to a consumer as a queue pair (QP). The messages can be sent over different transport types, which for an iWARP network may consist of a Transport Control Protocol/Internet Protocol (TCP/IP) transport type, and over an IB network, may consist of Reliable Connected (RC), Reliable Datagram (RD), Unreliable Connected (UC), Unreliable Datagram (UD), and Raw Datagram (RawD). Consumers retrieve the results of these messages from a completion queue (CQ) through SAN send and receive work completion (WC) queues. The source channel adapter takes care of segmenting outbound messages and sending them to the destination. The destination channel adapter takes care of reassembling inbound messages and placing them in the memory space designated by the destination's consumer.
For IB networks, two channel adapter types are present in nodes of the SAN fabric, a host channel adapter (HCA) and a target channel adapter (TCA). The HCA is used by general purpose computing nodes to access the SAN fabric. For iWARP, RDMA enabled NICs (RNICS) are present in the nodes of the SAN fabric and used to access the SAN fabric. Consumers use SAN “verbs” to access host channel adapter functions. The software that interprets verbs and directly accesses the channel adapter is known as the channel interface (CI).
Target channel adapters (TCA) are used by nodes that are the subject of messages sent from host channel adapters. The TCAs serve a similar function as that of the HCAs in providing the target node an access point to the SAN fabric. More information about SANs and the InfiniBand™ architecture may be obtained from the specification documents available from the InfiniBand™ Trade Association at www.infinibandta.org/specs/. More information about iWarp may be found at: the RDMA Consortium's home page at http://www.rdmaconsortium.org/home and the IETF's Transport Area's Remote Direct Data Placement Working Group home page at http://www.ietf.org/html.charters/rddp-charter.html.
To satisfy the requirement of high performance computing (HPC) application to be able to perform send operations to a collection of end-points, known SAN architectures, such as the InfiniBand™ architecture, provide three different transport services to handle send operations. The first transport service is the Reliable Connection (RC). This transport service requires the creation of a queue pair (QP) for each end-point and the consumer process must post a write work request (WR) for each end-point, i.e. on each QP. This method, while reliable, is not efficient as one WR and one QP has to be generated for each end-point.
Another transport service available under known SAN architectures, such as InfiniBand™ is a reliable datagram (RD). In the RD transport service, a single QP may communicate with multiple end-points. This is made possible through the end-to-end context (EEC) mechanism provided by the RD transport service described previously. A consumer may post work requests to a RD QP that target any of the EECs on the same host channel adapter (HCA) as the RD QP. While this transport service may eliminate the requirement that there be a separate QP for each end-point, this transport service is inefficient because the consumer process must post a WR for each end-point. Additionally, the RD service only allows one message to be outstanding on a link. This inability to pipeline work causes poor performance. Moreover, if the end-point has multiple QPs that need to receive the message that is the subject of the WR, then a Send WR must be posted to the HCA's RD QP for each QP of each end-point that is to receive the message.
While this transport service may eliminate the requirement that there be a separate QP for each end-point, this transport service is inefficient because the consumer process must post a WR for each end-point. Moreover, if the end-point has multiple QPs that need to receive the message that is the subject of the WR, then a WR must be posted to the HCA's RD QP for each QP of each end-point that is to receive the message.
The third transport service is the unreliable datagram (UD). The UD transport service is connectionless and unacknowledged. It allows the QP to communicate with any UD QP on any node. Using this service, consumer processes may multicast a message by posting one WR. However, there is no guarantee that the message will reach any or all the end-points.