Software applications residing on separate computers or devices communicate with each other over networks. Traditional network protocols, such as Ethernet and Asynchronous Transfer Mode (ATM), are not reliable for application-to-application communication and provide only machine-to-machine datagram delivery service. Transport protocol software operating on host machines can provide more direct and reliable application-to-application communication.
Typically, protocol software for network communication is implemented as a combination of a kernel-mode driver and a user-mode library. All application communication passes through these components. As a result, application communication consumes a significant amount of the resources of its host processor and incurs unnecessary latency. Both of these effects degrade application communication performance. This degradation significantly limits the overall performance of communication intensive applications, such as distributed databases.
Recently, a new class of connectivity called System Area Networks (SANs) has emerged to address the performance requirements of communication intensive distributed applications. SANs provide very high bandwidth communication with relatively low latency. SANs differ from existing technologies, such as Gigabit Ethernet and ATM, because they implement reliable transport functionality directly in hardware. Each SAN network interface controller (NIC) exposes individual transport endpoint contexts and demultiplexes incoming packets accordingly. Each endpoint is usually represented by a set of memory-based queues and registers that are shared by the host processor and the NIC. Many SAN NICs permit these endpoint resources to be mapped directly into the address space of a user-mode process which allows application processes to post messaging requests directly to the hardware. This design consumes very little of the resources of a host processor and adds little latency to the communication. As a result, SANs can deliver relatively fast communication performance to applications.
In general, SAN hardware does not perform any end-to-end flow control. Most distributed applications are designed to communicate using a specific transport protocol and a specific application programming interface (API). A large number of existing distributed applications are designed to use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite and some variant of the Berkeley Sockets API, such as Windows Sockets.
Some existing applications are designed to use a primary transport protocol and API such as a TCP/IP or Sockets-based API. In order to enable data transfer between machines in a SAN without using an existing or TCP/IP protocol on each machine, a new protocol must be implemented that controls the transfer of data from source memory buffers supplied by a first software application into destination memory buffers supplied by a second software application. This aspect of data transfer is known as flow control.
In SANs, Sockets Direct Protocol (SDP) and Windows Sockets Direct (WSD) protocol allow network applications, written using a sockets API, a direct path to system hardware. SDP provides several data transfer mechanisms. Broadly, there are two ways to transfer data in a SAN: as small messages or via remote direct memory access (RDMA) transfers.
Small messages are transferred from a private and pre-registered set of buffers of a source or send application to a private and pre-registered set of buffers of a sink or receive application. This mechanism is referred to as a buffer copy or BCopy. Each application operating on peer computers selects its own size and number of buffers. The source application is responsible to ensure that the message fits into the buffers of the receiving application.
For large data transfers or RDMA transfers, the (memory) buffers are dynamically registered prior to copying data. RDMA transfers are zero-copy transfers and bypass the kernel. Kernel bypass allows applications to issue commands to a NIC without having to execute a kernel call. RDMA requests are issued from local user space to the local NIC and over the network to the remote NIC without requiring any kernel involvement. This reduces the number of context switches between kernel space and user space while handling network traffic.
One type of RDMA transfer is a read zero-copy or Read ZCopy transfer. A Read ZCopy transfer is illustrated in FIG. 1. With reference to FIG. 1, a transfer application (not shown) operating on a data source device 102 sends a source available message 112 to a transfer application (not shown) operating on a data sink device 104. Next, the transfer application of the data sink device 104 performs a Read ZCopy transfer or RDMA read 114 by copying data directly from memory buffers of a user application operating on the data source device 102 to one or more memory buffers of another user application operating on the data sink device 104. Finally, the transfer application of the data sink device 104 sends an acknowledgement or RDMA Read Complete message 116 to the transfer application of the data source device 102.
Another type of RDMA transfer is a write zero-copy or Write ZCopy transfer. A Write ZCopy transfer is illustrated in FIG. 2. With reference to FIG. 2, A transfer application (not shown) of a data source device 102 sends a source available message 112 to a transfer application (not shown) of a data sink device 104; this message is optional. The data sink application sends a sink available message 212 to the application of the data source device 102 indicating that one or more memory buffers are ready to receive data. The data source transfer application responds by performing a Write ZCopy transfer or RDMA write 214 directly from user buffers of the data source device 102 to user buffers of the data sink device 104. The data source transfer application then sends a write complete message 216 to the application of the data sink device 104 indicating that the transfer is complete.
A third type of RDMA transfer is called a transaction mechanism and is similar to a Write ZCopy transfer; this mechanism is illustrated in FIG. 3. This type of transfer is optimum for transaction-oriented data traffic where one transfer application sends relatively small commands and expects medium to large responses. With reference to FIG. 3, a sink available message 312 is sent from a transfer application on the data sink device 104 to a transfer application on the data source device 102. However, in this transfer, extra information or “data” is appended to the message 312 indicating where to transfer data, via Write ZCopy, from a data source device 102 to a data sink device 104. Once the message is received, the transfer application or program operating on the data source device 102 performs an RDMA Write or Write ZCopy transfer 214 and sends a write complete message 216 to the transfer application of the data sink device 104 indicating that the transfer is complete.
Existing transfer applications using SDP and WSD manage both small and large data transfers through flow control modes. For example, SDP provides at least three modes: Pipelined Mode, Combined Mode, and Buffered Mode. Each transfer application is ordinarily in a single mode at any given time. However, mode is typically in reference to the transfer application which is receiving data. Mode change messages may cause the receiving application to change to a different mode.
Buffered Mode corresponds to always transferring data in small messages through BCopy.
Combined Mode corresponds to the receiving application waiting to receive an indication that data is available before it posts large receive buffers for RDMA transfers. Transfers in Combined Mode occur through BCopy, if the data size is smaller than an RDMA threshold, or through Read ZCopy. Since the sink user application expects a source available message before posting large RDMA buffers, the message typically contains a beginning portion of the send data.
Pipelined Mode corresponds to an application which always posts large receive buffers. In this mode, all types of transfers (e.g. BCopy, Read ZCopy, Write ZCopy) may be made. Since the application in this mode always pre-posts receive buffers, and is not waiting for any data receive information, the source available message does not carry data.
FIG. 4 shows the transition between the various SDP flow control modes. Each mode has a master transfer application and a slave transfer application; the master initiates a mode change by sending a mode change message and then immediately changes to a new mode. The master and slave applications must be careful not to use messages that are not allowed in a particular mode implying that a master application must finish sending data in modes not allowed in a new mode before sending a change mode message.
With reference to FIG. 4, once a connection is initialized between transfer applications operating on separate computers or devices, each transfer application is set to the Combined Mode 404. In this mode, if a transfer application (having source data) determines to change to either a Pipelined Mode 402 or a Buffered Mode 406, the source application initiates the change since it is the master application. The same is true if the source transfer application is in the Pipelined Mode 402. However, the sink transfer application is the master when the transfer applications are in the Buffered Mode 406.
Switching between modes and managing mode changing messages makes SDP and WSD excessively complex, especially since these protocols are designed for low-latency, high-throughput environments.