Despite recognized performance inefficiencies, Ethernet currently accounts for more than half of the interconnection networks in the top five hundred supercomputers due to its easy deployment and low cost of ownership. Ethernet is ubiquitously used in commercial and research clusters, serving high performance computing (HPC) and datacenter systems. Unfortunately, the large overhead that Gigabit and 10-Gigabit Ethernet network protocol processing puts on the central processing (CPU) cores has led to critical CPU availability and performance issues. For this, a wide range of efforts started to boost Ethernet efficiency, especially targeting its latency for HPC. The first major attempt was offloading transmission control protocol/internet protocol (TCP/IP) processing using stateless offload (e.g. offloading checksum, segmentation and reassembly, etc.) and stateful TCP Offload Engines (TOE).
Another major approach on top of TOE has been equipping Ethernet with techniques such as Remote Direct Memory Access (RDMA) and zero-copy communication that have traditionally been associated with other high performance interconnects such as InfiniBand. iWARP (Internet Wide Area RDMA Protocol) was the first standardized protocol to integrate such features into Ethernet, effectively reducing Ethernet latency and increasing host CPU availability by taking advantage of RDMA, kernel bypass capabilities, zero copy and non-interrupt based asynchronous communication. Rather than the traditional kernel level socket application program interface (API), iWARP provides a user-level interface that can be used in both local area network (LAN) and wide area network (WAN) environments, thus, efficiently bypassing kernel overheads such as data copies, synchronization and context switching.
Despite contributing to improving Ethernet efficiency, the current specification of iWARP lacks functionality to support the whole spectrum of Ethernet based applications. The current iWARP standard is only defined on reliable connection-oriented transports. Such a protocol suffers from scalability issues in large-scale applications due to memory requirements associated with multiple inter-process connections. In addition, some applications and data services do not require the reliability overhead and implementation complexity and cost associated with connection-oriented transports such as TCP.
It would therefore be desirable to have a method and system for implementing unreliable RDMA over datagrams.
iWARP Standard
Proposed by RDMA Consortium in 2002 to the Internet Engineering Task Force (IETF), the iWARP specification defines a multi-level processing stack on top of standard TCP/IP over Ethernet. The stack is designed to decouple the processing of Upper Layer Protocol (ULP) data from the operating system (OS) and reduce the host CPU utilization by avoiding intermediate copies during data transfer (zero copy). To achieve these goals, iWARP needs to be fully offloaded, for example on top of stateless and stateful TOE.
As illustrated in FIG. 1, at the top layer, iWARP provides a set of descriptive user-level interfaces called iWARP verbs. The verbs interface bypasses the OS kernel and is defined on top of an RDMA enabled stack. A network interface card (NIC) that supports the RDMA stack as described in iWARP standard is called an RDMA-enabled NIC or RNIC. An RNIC implements both iWARP stack and TOE functionality in hardware.
The RDMA protocol (RDMAP) layer supplies communication primitives for the verbs layer. Examples of data transfer primitives include Send, Receive, RDMA Write and RDMA Read that are passed as work requests (WR) to a Queue Pair (QP) data structure. The WRs are processed asynchronously by the RNIC, and completion is notified either by polled Completion Queue (CQ) entries or by event notification.
Verbs layer WRs are delivered in order from RDMAP to the lower layers. The Send and RDMA Write operations require a single message for data transfer, while the RDMA Read needs a request by the consumer (data sink), followed by a response from the supplier (data source). RDMAP is designed as a stream-based layer. Operations in the same stream are processed in the order of their submission.
The Direct Data Placement (DDP) layer is designed to directly transfer data from the user buffer to the network interface card (NIC) without intermediate buffering. The packet based DDP layer matches the data sink at the RDMAP layer with the incoming data segments based on two types of data placements models: tagged and untagged. The tagged model, used for one-sided RDMA Write and Read operations, has a sender-based buffer management in which the initiator provides a pre-advertised reference to the data buffer address at the remote side. The untagged model uses a two-sided Send/Receive semantic, where the receiver both handles buffer management and specifies the buffer address.
Due to DDP being a message-based protocol, out-of-order placement of message segments is possible, therefore DDP assures delivery of a complete message upon arrival of all segments. In the current iWARP specification, DDP assumes that the lower layer provides in order and correct delivery of messages.
The lower layer protocol (LLP) on which the iWARP stack is running can be either TCP or stream control transmission protocol (SCTP). Due to the message-oriented nature of DDP, the iWARP protocol requires an adaptation layer to put boundaries on DDP messages transferred over the stream oriented TCP protocol. The Marker PDU Alignment (MPA) protocol inserts markers into DDP data units prior to passing them to the TCP layer. It also re-assembles marked data units from the TCP stream and removes the markers before passing them to the DDP. The MPA layer is not needed on top of message-oriented transports such as SCTP or UDP because intermediate devices do not fragment message-based packets as they would with stream-based ones, removing the middle-box fragmentation issue that the MPA layer solves.
Shortcomings of the Current iWARP Standard
The current iWARP standard offers a range of capabilities that increase the efficiency of Ethernet in modern HPC and datacenters clusters. Taking advantage of the well-known reliable transports in the TCP/IP protocol suite is one of its key advantages. Reliability has in fact been a major force for designing the current iWARP standard on top of connection-oriented transports. The LLP in iWARP is assumed to be a point-to-point reliable stream, established prior to iWARP communication. This requirement makes it easy for the upper layer protocol (ULP) to assume reliable communication of user data. In addition, the independence of individual streams makes iWARP able to enforce error management on a per stream basis.
Such a standard is a fit for applications that require strict reliability at the lower layer, including data validation, flow control and in order delivery. Examples for such applications are reliable datacenter services such as database services, file servers, financial applications and policy enforcement systems (e.g. security applications, etc.).
On the other hand, there is a growing demand for applications that find the strict connection-based semantics of iWARP unnecessary. For such cases, the current iWARP standard imposes barriers to application scalability in large systems. The following subsections point to the shortcomings of the current standard and their relevant implications. As such, there is a strong need for RDMA with datagram transport.
Memory Usage.
The pervasiveness of the Ethernet in modern clusters places a huge demand on the scalability of the iWARP. The scale of high performance clusters is increasing rapidly and can soon reach to a million cores. A similar trend can be observed for datacenters. An obvious drawback of the connection-oriented iWARP is the connection memory usage that can exponentially grow with the number of processes. This dramatically increases the application's memory footprint, unveiling serious scalability issues for large-scale applications.
As the number of required connections increases, memory usage grows proportionately at different network stack layers. In a software implementation of iWARP at the TCP/IP layer, each connection will require a set of socket buffers allocated, in addition to the data structure required to maintain the connection state information. Although the socket buffers are not required in a hardware implementation of iWARP due to zero-copy on the fly processing of data, making a lot of connections will have other adverse effects. Due to limited RNIC cache for connection state information, maintaining out-of-cache connections will require extra memory requests by the RNIC. The other major place of memory usage is the application layer. Specifically, the communication libraries such as message passing interface (MPI) pre-allocate memory buffers per connection to be used for fast buffering and communication management.
Performance
In addition to memory usage problems, connection oriented protocols such as TCP, with their inherent flow-control and congestion management, limit performance. HPC applications running on a local cluster do not require the complexities of TCP flow and congestion management. User datagram protocol (UDP) offers a much lighter weight protocol that can significantly reduce the latency of individual messages, closing the latency gap between iWARP and other high speed interconnects. In addition, many datacenter applications such as those using media streaming protocols over WAN are currently running on top of unreliable datagram transports such as UDP. Due to such semantic discrepancies, the current connection-oriented specification of iWARP makes it impossible for such applications to take advantage of iWARP's major benefits such as zero copy and kernel bypass.
Fabrication Cost
The complexities associated with stream based lower layer protocols (LLPs) such as TCP and SCTP translate into expensive and non-scalable hardware implementations. This becomes especially important with modern multi-core systems where multiple processes could utilize the offloaded stack. A heavyweight protocol such as SCTP or even TCP can partially support multiple parallel on-node requests, due to implementation costs associated with hardware level parallelism.
Hardware Level Operations
iWARP lacks useful operations such as hardware level multicast and broadcast. These operations, if supported, can be utilized in applications with MPI collectives and also media streaming services. iWARP does not support such operations primarily because the underlying TCP protocol is not able to handle multicast and broadcast operations.
Datagram-Based Applications
Despite the benefits offered by iWARP, many datacenter and web-based applications, such as stock-market trading and media-streaming applications, that rely on datagram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only defined over reliable, connection-oriented transports. Moreover, currently one-sided RDMA operations (such as RDMA Write) are only defined on reliable and connected transports. This effectively limits the number of applications that could utilize iWARP and its RDMA capabilities by excluding User Datagram Protocol (UDP), which according to Cisco Systems, could comprise more than 90% of all Internet consumer traffic by 2014.
Connection-based iWARP has a number of limitations that make it inappropriate for large-scale systems. First, the scalability of the current iWARP is limited since the hardware needs to keep data for each and every connection in hardware or host memory. This limits its effectiveness for applications that are required to service a very large number of clients at a single time. In addition, the current iWARP standard suffers from numerous overheads associated with connection-based transports such as Transmission Control Protocol (TCP) and Stream Control Transmission Protocol (SCTP) [21]. High overhead reliability and flow-control measures in TCP and SCTP protocols impose the burden of unnecessary communication processing on applications running on low error-rate networks (such as High Performance Computing (HPC) and data center clusters) as well as applications which do not require reliability.
Moreover, the complexities and overhead associated with packet marking, which is required to adapt the message-oriented iWARP stack over the stream-oriented TCP protocol, further reduce the overall message rate that can be achieved with the current TCP-based iWARP standard.
Clearly, there is a need for a method and system for implementing unreliable RDMA over datagrams.