Recent technologies allow implementing networking protocols, such as Ethernet or INFINIBAND, between multiple hosts over a variety of physical interfaces which are not necessary the protocol's native physical layer. For example, Ethernet networking within a blade or a rack enclosure may use PCIe or some vendor proprietary physical interconnect as the physical layer (PIIY) instead of using native 802.3 media access control (MAC) and PIIY interconnects. The communication model in such systems is typically based on the well-known logical operations of send( ), i.e. the logical operation of sending information, e.g. messages or data packets and receive( ), i.e. the logical operation of receiving information, e.g. messages or data packets that are used for sourcing and sinking the logical units of communications (which are messages or data packets) and are performed by the transmitting and receiving sides respectively. The data flow begins with a send( ) operation in which the application posts a data buffer to the host kernel to be transmitted over the fabric. The buffer is passed through the transport and routing stacks of the kernel and will eventually reach the device driver of a network device (e.g. network interface card (NIC) or host bus adapter (HBA)) that serves as an entry point into the fabric. At this point the actual packet data may span multiple physical buffers, for example since the transport and routing stacks may add their headers as part of the process of turning the buffer into a transmittable network packet. The device driver will provide the device with pointers to the locations of the different buffers that compose the network packet. The device, equipped with its own direct memory access (DMA) engine, will then read the different buffers from host memory, compose a network packet and send it across the network fabric. The packet will traverse the fabric and will eventually reach the device of the destination host. The destination device will use its DMA engine to write the packet into a host buffer that was pre-allocated by the software stack by means of a receive( ) operation. When DMA is completed the device driver will receive a corresponding indication and will forward the packet up the relevant networking stack.
While modern fabric interconnects attempt to support cut-through delivery of packets across such fabric interconnect, the support for real end-to-end cut-through is constrained by the host's behavior. The presented disclosure will allow the packet to be sent from the source buffers at the source host into the destination buffer at the destination host using a “pure” cut-through method that allows immediate injection of data into the fabric even if it has arrived out-of-order or interleaved with data of other packets from the memory of the source host.
A data packet should be sent from a memory host location(s) of a source host into a memory host location at a destination host. Each host has an attached “device” and all devices are connected to an interconnect fabric. Each device is responsible for the operations of reading a packet from host memory and sending it into the network fabric and receiving packets from the network fabric and placing them at the host's memory. The device is connected to the memory subsystem of its host through a PCIe interface.
The device at the source reads the packet by submitting PCIe read requests to the multiple buffers that span the packet, building the packet as the payload is received on the corresponding read responses and sending the packet over the interconnect fabric. It is a common practice for high-speed devices to have multiple outstanding read requests submitted on the PCIe fabric in order to allow for saturation of the downstream PCIe link. The PCIe specification allows the completion data of multiple outstanding read requests from the same source to arrive out-of-order with respect to the original read request submission order. Furthermore, the PCIe specification allows the completer of a PCIe read requests to split the read response over multiple completion packets. These two relaxations create a completion stream that may be both out-of-order and interleaved as can be seen from FIG. 5 illustrating the case of two read requests “Read A” and “Read B” 501.
For example, if read requests A, B 501 are submitted one after the other on the PCIe bus then the corresponding read completions 503 may arrive in the following order (starting from left): B1, A1, A2, B2, B3, A3. A standard device interface would need store and forward the completion data 503 before composing it into packets 505 and submitting the packets 505 into the fabric. This buffering would be needed by typical devices for two main reasons: a) If A, B are read requests for buffers that account for a single network packet then data re-ordering is needed since the buffers should be composed into a packet in an order which preserves the original payload order of the packet b) If A, B are read requests where each request represents a separate network packet then re-ordering is needed since the different packets cannot be sent in an interleaved way over the fabric (however note that in our example the read responses for these 3 packets were interleaved by the host). Reference sign 501 describes a read request message, reference sign 503 describes a read completion message and reference sign 505 describes packets of a fabric message in FIG. 5.
The store and forward buffering that was just described introduces a delay Δt that contributes to the end-to-end latency of packets route in the fabric.