Communication services in packet networks are commonly classified as “reliable” or “unreliable.” (In a more positive light, “unreliable services” are also referred to as “best effort” services.) Reliable services guarantee that data will be accepted at the receiving end of a link in the order in which they were transmitted from the transmitting end, without loss and without duplication. For a communication protocol to support reliability, it typically must identify each individual datagram that is transmitted, and it must provide an acknowledgment mechanism that enables the receiver to inform the transmitter which datagrams it has or has not received.
In the well-known Open Systems Interconnection (OSI) communications model, the transport layer (Layer 4) is responsible for ensuring the reliable arrival of messages, as well as providing error checking mechanisms and data flow controls. The transport layer provides services for both “connection-mode” transmissions and for “connectionless-mode” transmissions. A message or other datagram generated by the transport layer may typically be broken up into multiple packets for transmission over a network. The packet payloads must then be reassembled at the receiving end to recover the original message or datagram.
The Transmission Control Protocol (TCP) is the transport layer protocol used in Internet Protocol (IP) networks for reliable, connection-mode communications. TCP is described by Postel in RFC 793 of the U.S. Defense Advanced Research Projects Agency (DARPA), entitled “Transmission Control Protocol: DARPA Internet Program Protocol Specification” (1981), which is incorporated herein by reference. TCP provides for reliable inter-process communication between pairs of processes in host computers. The information exchanged between TCP peers is packed into datagrams known as segments, each comprising a TCP header followed by payload data. The segments are transported over the network in IP packets. There is typically no relation between the boundaries of TCP segments and actual messages generated by host application protocols. Rather, the application process simply provides a stream of data to a TCP socket, typically including both message headers and payload data, and the TCP transmitter divides the stream into segments according to its own rules.
When data segments arrive at the receiver, TCP requires that the receiver send back an acknowledgment (ACK) of the data. When the sender does not receive the ACK within a certain period of time, it retransmits the data. TCP specifies that the bytes of transmitted data be sequentially numbered, so that the receiver can acknowledge the data by naming the highest-numbered byte it has received, thus also acknowledging the previous bytes. RFC 793 contains only a general assertion that data should be acknowledged promptly, but gives no more specific indication as to how quickly an acknowledgement must be sent, or how much data should be acknowledged in each separate acknowledgement. The decision as to when to send an acknowledgment is in the hands of the receiver.
Packet network communications are a central element in new high-speed, serial input/output (I/O) bus architectures that are gaining acceptance in the computer industry. In these systems, computing hosts and peripherals are linked together by a switching network, commonly referred to as a switching fabric, taking the place of parallel buses that are used in traditional systems. A number of architectures of this type have been proposed, culminating in the “InfiniBand™” (IB) architecture, which is described in detail in the InfiniBand Architecture Specification, Release 1.0 (October, 2000), which is incorporated herein by reference. This document is available from the InfiniBand Trade Association at www.infinibandta.org.
As described in Chapter 9 of the specification, IB supports both unreliable and reliable transport services. The reliable services include both reliable connection service and reliable (connectionless) datagram service. The reliable transport services use a combination of sequence numbers and acknowledgment messages (ACK/NACK) to verify packet delivery order, prevent duplicate packets and out-of-sequence packets from being processed, and to detect missing packets. (A “NACK” is a negative acknowledgment message, used to indicate a flaw in the received data.) For reliable transport services, an IB operation is defined as including a request message and its corresponding response. The request message consists of one or more request packets, while the response, except for remote direct memory access (RDMA) read responses, consists of exactly one packet. The response packets may either convey data requested in the request message (thus implicitly acknowledging the request) or may be acknowledgment packets. In the event of a NACK or failure to receive an ACK within a prescribed timeout period, the packets are retransmitted, beginning with the next packet serial number (PSN) after the last one that was positively acknowledged. The sender maintains a retry counter in order to determine the number of times the operation will be retried before reporting back to the host that the request could not be carried out.
To indicate that a particular request packet should be explicitly acknowledged, the requester can set a designated flag, known as the AckReq bit, in the packet transport header. The IB specification does not place any limitations on when the AckReq bit should or should not be used, and suggests only that the bit should be set in certain cases in the last packet of a given message. When the AckReq bit is not set, it is up to the receiver to determine how often it should send acknowledgments.
A host processor (or host) connects to the IB fabric via a network interface adapter, which is referred to in IB parlance as a host channel adapter (HCA). Client processes running on the host communicate with the transport layer of the IB fabric by manipulating transport service instances, known as “queue pairs” (QPs), each made up of a send work queue and a receive work queue. Communications take place between a local QP maintained by the HCA and a remote QP maintained by a channel adapter at the other side of the fabric. To send and receive messages over the network, the client initiates work requests (WRs), which cause work items, called work queue elements (WQEs), to be placed in the appropriate queues. For each work request, the client prepares a descriptor defining the operation to be performed by the HCA.
For remote direct memory access (RDMA) and send operations, the descriptor typically contains a gather list pointing to data that are to be read out of memory and transmitted as part of the message. To execute RDMA write and send operations, the HCA reads the corresponding descriptors, fetches the data specified in the gather list from the host memory, and loads the data into packets for transmission over the fabric to the remote QP. Since the gather list in a single WR may specify as much as 231 bytes (2 GB) of data to be transmitted, while the IB fabric does not support packets larger than 4 KB, some WQEs can require the HCA to generate a large number of packets. (Each QP has its own maximum transfer unit (MTU), or maximum packet size, which may be 256, 512, 1024, 2048 or 4096 bytes.) Unlike TCP/IP, however, in which there is no fixed relation between message boundaries and packet boundaries, the IB transport layer protocol specifies that each WR and WQE corresponds to a single message. The boundaries of the first and last packet for a given WQE thus correspond to the boundaries of the message. The size of the first and subsequent packets, except for the last packet, is equal to the MTU. The last packet takes up the remainder of the message, of length less than or equal to the MTU.
In generating an outgoing message or servicing an incoming message on any given QP, the HCA uses context information pertaining to the QP. The QP context is created in a memory accessible to the HCA by the host process that sets up the QP. The host configures the QP context with fixed information such as the destination address, negotiated operating limits, service level and keys for access control. Typically, a variable part of the context, such as the current packet sequence number (PSN) and information regarding the WQE being serviced by the QP, is subsequently updated by the HCA as it sends and receives messages. For example, to service an incoming packet on a reliable connection, the HCA reads the packet transport header, which identifies the target QP, and uses the context of that QP to verify that the packet came from the correct source and that the PSN is valid (no missed packets). Based on this information, the HCA generates the appropriate acknowledgment (ACK or NACK) or other response. As another example, to generate a RDMA write request on a reliable connection, the HCA reads the WQE and retrieves necessary data from the QP context, such as the destination address, target QP and next PSN. It then accesses the host memory to fetch the required data, and sends the packet to the destination.