In computer networking, data packets are frequently delivered to their destination out of order, i.e., in a different order from that in which they were sent. Out-of-order delivery is most commonly caused by packets following multiple different paths through a network with different transmission latencies.
Out-of-order delivery is a common phenomenon in Internet Protocol (IP) networks. In the well-known Transmission Control Protocol (TCP), TCP frames are divided into multiple segments, which are encapsulated in corresponding IP data packets. Each segment carries a sequence number in the TCP header, and the length of the data payload in the segment can be derived from the length field in the IP header of the packet. Thus, even when the IP packets carrying the segments of a TCP frame arrive at their destination out of order, the receiver is able to reorder the packets and write the payloads to its local memory in the proper sequence.
To relieve the host processor of the burden of TCP processing in software, some network interface controllers (NICs) offer TCP offload capabilities. NICs of this sort are capable of both processing the TCP headers and of writing and reading data directly to and from the host memory. For example, U.S. Pat. No. 7,760,741 describes a network acceleration architecture for use with TCP. The architecture includes a hardware acceleration engine adapted for communication with and processing data from a consumer application, a software protocol processor adapted for carrying out TCP implementation, and an asynchronous dual-queue interface for exchanging information between the hardware acceleration engine and the software protocol processor. A virtually-contiguous reassembly buffer is used to handle out-of-order segments.
InfiniBand™ (IB) is a switched-fabric communications architecture that is widely used in high-performance computing. Computing devices (host processors and peripherals) connect to the IB fabric via a NIC that is referred to in IB parlance as a channel adapter. Host processors (or hosts) use a host channel adapter (HCA), while peripheral devices use a target channel adapter (TCA). The IB architecture defines both a layered hardware protocol (Physical, Link, Network, Transport Layers) and a software layer, which manages initialization and communication between devices.
Processes executing on nodes of an IB network communicate with one another using a queue-based model. Sending and receiving processes establish a queue pair (QP), which consists of a send queue (SQ) and a receive queue (RQ). Send and receive work requests (WR) by a process running on a host cause corresponding commands, known as work queue elements (WQEs), to be loaded into these queues for processing by the HCA. The WQE causes the HCA to execute a transaction, in which a message containing data is transmitted over the network. The message data may be spread over the payloads of multiple, successive packets. The transaction may comprise, for example, a remote direct memory access (RDMA) read or write transaction or a SEND transaction. (To receive a SEND message on a given QP, a receive WQE indicating the receive buffer address is posted to that QP.) Upon completion of a WQE, the HCA posts a completion queue element (CQE) to a completion queue, to be read by the initiating process as an indication that the WR has been fulfilled.
Each QP is treated by the IB transport layer as a unique transport service instance. The transport layer is responsible for in-order packet delivery, partitioning, channel multiplexing and transport services. The transport layer also handles transaction data segmentation when sending and reassembly when receiving. Based on the Maximum Transfer Unit (MTU) of the path, the transport layer divides the data into packets of the proper size. A receiver reassembles the packets based on the Base Transport Header (BTH), which contains the destination queue pair and packet sequence number (PSN). The receiving HCA acknowledges the packets, and the sending HCA receives these acknowledgements and updates the completion queue with the status of the operation.
InfiniBand specifies a number of different transport services, including Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), and Unreliable Datagram (UD). RC provides a reliable transfer of data between two entities, which supports RDMA operations and SEND operations, as well as atomic operations, with reliable channel semantics. As a connection-oriented transport, RC requires a dedicated queue pair (QP) for each pair of requester and responder processes. Recently-developed alternatives to the original RC model include the Extended Reliable Connected (XRC) transport service, in which a single receive QP to be shared by multiple shared receive queues (SRQs) across one or more processes running on a given host; and reliable connections provided by the Dynamically-Connected (DC) transport service, as described, for example, in U.S. Pat. No. 8,213,315.