Switched-fabric communication architectures are widely used in high-performance computing. Examples of such architectures include InfiniBand™ and high-speed Ethernet™. Computing devices (host processors and peripherals) connect to the switched fabric via a network interface controller (NIC), which is referred to in InfiniBand (IB) parlance as a channel adapter. Host processors (or hosts) use a host channel adapter (HCA), while peripheral devices use a target channel adapter (TCA).
Client processes, such as software application processes, running on a host processor communicate with the transport layer of the fabric by manipulating a transport service instance, known as a “queue pair” (QP), which is made up of a send queue (SQ) and a receive queue (RQ). To send and receive messages over the network using a HCA, the client submits work requests (WRs), which cause work items, known as work queue elements (WQEs), to be placed in the appropriate work queues in the host memory for execution by the HCA. After it has finished servicing a WQE, the HCA typically writes a completion report, in the form of a completion queue element (CQE), to a completion queue in the host memory, to be read by the client process as an indication that the work request has been executed.
InfiniBand specifies a number transport services, which support process-to-process communications between hosts over a network. In general, reliable IB transport services require a dedicated QP for each pair of requester and responder processes. In some cases, however, a single receive QP may be shared by multiple processes running on a given host. For example, the Extended Reliable Connected (XRC) transport service enables each process to maintain a single send QP for each host, rather than to each remote process, while a receive QP is established per remote send QP and can be shared among all the processes on the host.
Although the above terminology and some of the embodiments in the description that follows are based on features of the IB architecture and use vocabulary taken from IB specifications, similar mechanisms exist in networks and I/O devices that operate in accordance with other protocols, such as Ethernet, OmniPath, iWARP and Fibre Channel. The IB terminology and features are used herein by way of example, for the sake of convenience and clarity, and not by way of limitation.
In some communication networks, a network node processes data received over the network using a local co-processor, also referred to as an accelerator or peer device. Various methods for delivering data to the accelerator are known in the art. For example, PCT International Publication WO 2013/180691, whose disclosure is incorporated herein by reference, describes devices coupled via one or more interconnects. In one embodiment, a Network Interface Card (NIC), such as a Remote Direct Memory Access (RDMA) capable NIC, transfers data directly into or out of the memory of a peer device that is coupled to the NIC via one or more interconnects, bypassing a host computing and processing unit, a main system memory or both.
PCT International Publication WO 2013/136355, whose disclosure is incorporated herein by reference, describes a network node that performs parallel calculations on a multi-core GPU. The node comprises a host and a host memory on which a calculation application can be installed, a GPU with a GPU memory, a bus and a Network Interface Card (NIC). The NIC comprises means for receiving data from the GPU memory and metadata from the host over the bus, and for routing the data and metadata towards the network. The NIC further comprises means for receiving data from the network and for providing the data to the GPU memory over the bus. The NIC thus realizes a direct data path between the GPU memory and the network, without passing the data through the host memory.