The computer industry is moving toward fast, packetized, serial input/output (I/O) bus architectures, in which computing hosts and peripherals are linked by a switching network, commonly referred to as a switching fabric. A number of architectures of this type have been proposed, culminating in the “InfiniBand™” (IB) architecture, which has been advanced by a consortium led by a group of industry leaders (including Intel, Sun Microsystems, Hewlett Packard, IBM, Compaq, Dell and Microsoft). The IB architecture is described in detail in the InfiniBand Architecture Specification, Release 1.0 (October, 2000), which is incorporated herein by reference. This document is available from the InfiniBand Trade Association at www.infinibandta.org.
Computing devices (hosts or peripherals) connect to the IB fabric via a network interface adapter, which is referred to in IB parlance as a channel adapter. The IB specification defines both a host channel adapter (HCA) for connecting a host processor to the fabric, and a target channel adapter (TCA), intended mainly for connecting peripheral devices to the fabric. Typically, the channel adapter is implemented as a single chip, with connections to the computing device and to the network. Client processes (referred to hereinafter as clients) running on a host processor communicate with the transport layer of the IB fabric by manipulating a transport service instance, known as a “queue pair” (QP) made up of a send work queue and a receive work queue. The IB specification permits the HCA to allocate as many as 16 million (224) QPs, each with a distinct queue pair number (QPN). A given client may open and use multiple QPs simultaneously.
To send and receive messages over the network using a HCA, the client initiates work requests (WRs), which cause work items, called work queue elements (WQEs), to be placed onto the appropriate work queues. Normally, each WR has a data buffer associated with it, to be used for holding the data that is sent or received in executing the WQE. Each QP has its own WQE chain and associated data buffers. Each WQE in the chain and the buffer associated with it are passed to the control of the HCA when the WQE is posted. The HCA then executes the WQEs, so as to communicate with the corresponding QP of the channel adapter at the other end of the link. After it has finished servicing a WQE, the HCA typically writes a completion queue element (CQE) to a completion queue, to be read by the client. The buffer associated with the WQE is freed for use by the client only after the CQE is generated.
The QP that initiates a particular operation, i.e. injects a message into the fabric, is referred to as the requester, while the QP that receives the message is referred to as the responder. an IB operation is defined to include a request message generated by the requester and, as appropriate, its corresponding response generated by the responder. (Not all request messages have responses.) Each message consists of one or more IB packets. Typically, a given HCA will serve simultaneously both as a requester, transmitting requests and receiving responses on behalf of local clients, and as a responder, receiving requests from other channel adapters and returning responses accordingly. Each QP is configured for a certain transport service type, which determines how the requesting and responding QPs interact. Both the source and destination QPs must be configured for the same service type. The IB specification defines four service types: reliable connection, unreliable connection, reliable datagram and unreliable datagram.
IB request messages include, inter alia, remote direct memory access (RDMA) write and send requests, RDMA read requests, and atomic read-modify-write requests. Both RDMA write and send requests carry data sent by the requester and cause the responder to write the data to a memory address at its own end of the link. Whereas RDMA write requests specify the address in the remote responder's memory to which the data are to be written, send requests rely on the responder to determine the memory location at the request destination. The send operation is sometimes referred to as a “push” operation, since the initiator of the data transfer pushes data to the remote QP. The receiving node's channel adapter places the data into the next available receive buffer for that QP. The send operation is also referred to as having channel semantics, because it moves data much like a mainframe I/O channel: each packet of data is tagged with a discriminator, and the destination processor chooses where to place the data based on the discriminator. In the case of IB send packets, the discriminator is the destination address (i.e., the local identifier, or LID) of the receiving channel adapter and the destination QP number.
To specify the receive buffers to use for incoming send requests received by a channel adapter, a client on the host computing device must generate receive WQEs and place them in the receive queues of the appropriate QPs. Each time a valid send request is received, the destination channel adapter takes the next WQE from the receive queue and places the received data in the memory location specified in that WQE. Thus, every valid incoming send request engenders a receive queue operation by the responder.
It follows from this paradigm of send message handling that the destination channel adapter can receive and process incoming send packets on a given QP only when there is an appropriate WQE waiting to be read from the receive queue of the QP. To meet this requirement, the host computing device must prepare and hold in memory at least one receive WQE for every QP that is configured to receive send messages. When an incoming send packet arrives at the destination channel adapter on a given QP, and there is no receive WQE available, the channel adapter cannot process the packet and must therefore discard it. In the case of reliable services, when there is no receive WQE on hand, the channel adapter returns a “Receiver Not Ready” (RNR) NACK packet to the requester. The requester may then retry the send request after a suitable waiting period has passed.
To avoid this situation, the IB specification provides a flow control mechanism for send messages using reliable connection services, based on end-to-end credits. As a rule, a requester cannot sent a request message unless it has the appropriate credits to do so. These credits are passed to the requester by the responder, wherein each credit represents the resources needed by the responder to receive one inbound request message. Specifically, each credit represents one WQE posted to the receive queue of the responding QP.
Given the large number of QPs (up to 16 million) that can be in use at any one time, the need to keep a WQE available in every receive queue can consume a great deal of memory. Practically speaking, it is much more efficient for both the host computing device and the channel adapter to create and maintain several WQEs in the receive queue at any given time, thus increasing even further the memory and computing resources needed for each QP. It can be seen that a prohibitive amount of memory is thus required if a large complement of QPs is to be supported, as provided by the IB specification.