1. Field of the Invention
This invention relates in general to the field of computer communications and more specifically to an apparatus and method for effectively and efficiently inserting and removing MPA markers into a TCP byte stream for communicating via an RDMA-over-Ethernet fabric.
2. Description of the Related Art
This first computers were stand-alone machines, that is, they loaded and executed application programs one-at-a-time in an order typically prescribed through a sequence of instructions provided by keypunched batch cards or magnetic tape. All of the data required to execute a loaded application program was provided by the application program as input data and execution results were typically output to a line printer. Even though the interface to early computers was cumbersome at best, the sheer power to rapidly perform computations made these devices very attractive to those in the scientific and engineering fields.
The development of remote terminal capabilities allowed computer technologies to be more widely distributed. Access to computational equipment in real time fostered the introduction of computers into the business world. Businesses that processed large amounts of data, such as the insurance industry and government agencies, began to store, retrieve, and process their data on computers. Special applications were developed to perform operations on shared data within a single computer system.
During the mid 1970's, a number of successful attempts were made to interconnect computers for purposes of sharing data and/or processing capabilities. These interconnection attempts, however, employed special purpose protocols that were intimately tied to the architecture of these computers. As such, the computers were expensive to procure and maintain and their applications were limited to those areas of the industry that heavily relied upon shared data processing capabilities.
The U.S. government, however, realized the power that could be harnessed by allowing computers to interconnect, and thus funded research that resulted in what we now know as the Internet. More specifically, this research resulted in a series of standards that specify the details of how interconnected computers are to communicate, how to interconnect networks of computers, and how to route traffic over these interconnected networks. This set of standards is known as the TCP/IP Internet Protocol Suite, named after its two predominant protocol standards, Transport Control Protocol (TCP) and Internet Protocol (IP). TCP is a protocol that allows for a reliable byte stream connection between two computers. IP is a protocol that provides an addressing and routing mechanism for unreliable transmission of datagrams across a network of computers. The use of TCP/IP allows a computer to communicate across any set of interconnected networks, regardless of the underlying native network protocols that are employed by these networks. Once the interconnection problem was solved by TCP/IP, networks of interconnected computers began to crop up in all areas of business.
The ability to easily interconnect computer networks for communication purposes provided the motivation for the development of distributed application programs, that is, application programs that perform certain tasks on one computer connected to a network and certain other tasks on another computer connected to the network. The sophistication of distributed application programs has steadily evolved over more recent years into what we today call the client-server model. According to the model, “client” applications on a network make requests for service to “server” applications on the network. The “server” applications perform the service and return the results of the service to the “client” over the network. In an exact sense, a client and a server may reside on the same computer, but the more common employment of the model finds clients executing on smaller, less powerful, less costly computers connected to a network and servers executing on more powerful, more expensive computers. In fact, the proliferation of client-server applications has resulted in a class of high-end computers being known as “servers” because they are primarily used to execute server applications. Similarly, the term “client machine” is often used to describe a single-user desktop system that executes client applications.
Client-server application technology has enabled computer usage to be phased into the business mainstream. Companies began employing interconnected client-server networks to centralize the storage of files, company data, manufacturing data, etc., on servers and allowed employees to access this data via clients. Servers today are sometimes known by the type of services that they perform. For example, a file server provides client access to centralized files, a mail server provides access to a company's electronic mail, a data base server provides client access to a central data base, and so on.
The development of other technologies such as hypertext markup language (HTML) and extensible markup language (XML) now allows user-friendly representations of data to be transmitted between computers. The advent of HTML/XML-based developments has resulted in an exponential increase in the number of computers that are interconnected because, now, even home-based businesses can develop server applications that provide services accessible over the Internet from any computer equipped with a web browser application (i.e., a web “client”). Furthermore, virtually every computer produced today is sold with web client software. In 1988, only 5,000 computers were interconnected via the Internet. In 1995, under five million computers were interconnected via the Internet. But with the maturation of client-server and HTML technologies, presently, over 50 million computers access the Internet. And the growth continues.
The number of servers in a present day data center may range from a single server to hundreds of interconnected servers. And the interconnection schemes chosen for those applications that consist of more than one server depend upon the type of services that interconnection of the servers enables. Today, there are three distinct interconnection fabrics that characterize a multi-server configuration. Virtually all multi-server configurations have a local area network (LAN) fabric that is used to interconnect any number of client machines to the servers within the data center. The LAN fabric interconnects the client machines and allows the client machines access to the servers and perhaps also allows client and server access to network attached storage (NAS), if provided. One skilled in the art will appreciate that TCP/IP over Ethernet is the most commonly employed protocol in use today for a LAN fabric, with 100 Megabit (Mb) Ethernet being the most common transmission speed and 1 Gigabit (Gb) Ethernet gaining prevalence in use. In addition, 10 Gb Ethernet links and associated equipment are currently being fielded.
The second type of interconnection fabric, if required within a data center, is a storage area network (SAN) fabric. The SAN fabric provides for high speed access of block storage devices by the servers. Again, one skilled in the art will appreciate that Fibre Channel is the most commonly employed protocol for use today for a SAN fabric, transmitting data at speeds up to 2 Gb per second, with 4 Gb per second components that are now in the early stages of adoption.
The third type of interconnection fabric, if required within a data center, is a clustering network fabric. The clustering network fabric is provided to interconnect multiple servers to support such applications as high-performance computing, distributed databases, distributed data store, grid computing, and server redundancy. A clustering network fabric is characterized by super-fast transmission speed and low-latency. There is no prevalent clustering protocol in use today, so a typical clustering network will employ networking devices developed by a given manufacturer. Thus, the networking devices (i.e., the clustering network fabric) operate according to a networking protocol that is proprietary to the given manufacturer. Clustering network devices are available from manufacturers such as Quadrics Inc. and Myricom. These network devices transmit data at speeds greater than 1 Gb per second (Gb/sec) with latencies on the order of microseconds. It is interesting, however, that although low latency has been noted as a desirable attribute for a clustering network, more than 50 percent of the clusters in the top 500 fastest computers today use TCP/IP over Ethernet as their interconnection fabric.
It has been observed by many in the art that a significant performance bottleneck associated with networking in the near term will not be the network fabric itself, as has been the case in more recent years. Rather, the bottleneck is now shifting to the processor. More specifically, network transmissions will be limited by the amount of processing required of a central processing unit (CPU) to accomplish TCP/IP operations at 1 Gb/sec (and greater) speeds. In fact, the present inventors have noted that approximately 40 percent of the CPU overhead associated with TCP/IP operations is due to transport processing, that is, the processing operations that are required to allocate buffers to applications, to manage TCP/IP link lists, etc. Another 20 percent of the CPU overhead associated with TCP/IP operations is due to the processing operations which are required to make intermediate buffer copies, that is, moving data from a network adapter buffer, then to a device driver buffer, then to an operating system buffer, and finally to an application butter. And the final 40 percent of the CPU overhead associated with TCP/IP operations is the processing required to perform context switches between an application and its underlying operating system which provides the TCP/IP services. Presently, it is estimated that it takes roughly 1 GHz of processor bandwidth to provide for a typical 1 Gb/second TCP/IP network. Extrapolating this estimate up to that required to support a 10 Gb/second TCP/IP network provides a sufficient basis for the consideration of alternative configurations beyond the TCP/IP stack architecture of today, most of the operations of which are provided by an underlying operating system.
As alluded to above, it is readily apparent that TCP/IP processing overhead requirements must be offloaded from the processors and operating systems within a server configuration in order to alleviate the performance bottleneck associated with current and future networking fabrics. This can be accomplished in principle by 1) moving the transport processing requirements from the CPU down to a network adapter; 2) providing a mechanism for remote direct memory access (RDMA) operations, thus giving the network adapter the ability to transfer data directly to/from application memory; and 3) providing a user-level direct access technique that allows an application to directly command the network adapter to send/receive data, thereby bypassing the underlying operating system.
The INFINIBAND™ protocol was an ill-fated attempt to accomplish these three “offload” objectives, while at the same time attempting to increase data transfer speeds within a data center. In addition, INFINIBAND attempted to merge the three disparate fabrics (i.e. LAN, SAN, and cluster) by providing a unified point-to-point fabric that, among other things, completely replaces Ethernet, Fibre Channel, and vendor-specific clustering networks. On paper and in simulation, the INFINIBAND protocol was extremely attractive from a performance perspective because it enabled all three of the above objectives and increased networking throughput overall. Unfortunately, the architects of INFINIBAND overestimated the community's willingness to abandon their tremendous investment in existing networking infrastructure, particularly that associated with Ethernet fabrics. And as a result, INFINIBAND has not become a viable option for the marketplace.
INFINIBAND did, however, provide a very attractive mechanism for offloading reliable connection network transport processing from a CPU and corresponding operating system. One aspect of this mechanism is the use of “verbs”. Verbs is an abstract architected programming interface between a network input/output (I/O) adapter and a host operating system (OS) or application software, which 1) enables moving reliable connection transport processing from a host CPU to the I/O adapter; 2) provides for the I/O adapter to perform direct data placement (DDP) through the use of RDMA read messages and RDMA write messages, as will be described in greater detail below; and 3) enables bypass of the OS. INFINIBAND defined a new type of reliable connection transport for use with verbs, but as one skilled in the art will appreciate, a verbs interface mechanism will work equally well with the TCP reliable connection transport. At a very high level, this mechanism consists of providing a set of commands (“verb”) which can be executed by an application program, without operating system intervention, that direct an appropriately configured network adapter (not part of the CPU) to directly transfer data to/from server (or “host”) memory, across a network fabric, where commensurate direct data transfer operations are performed in host memory of a counterpart server. This type of operation, as noted above, is referred to as RDMA, and a network adapter that is configured to perform such operations is referred to as an RDMA-enabled network adapter. In essence, an application executes a verb to transfer data and the RDMA-enabled network adapter moves the data over the network fabric to/from host memory.
Many in the art have attempted to preserve the attractive attributes of INFINIBAND (e.g., reliable connection network transport offload, verbs, RDMA) as part of a networking protocol that utilizes Ethernet as an underlying network fabric. In fact, over 50 member companies are now part of what is known as the RDMA Consortium (www.rdmaconsortium.org), an organization founded to foster industry standards and specifications that support RDMA over TCP. RDMA over TCP/IP defines the interoperable protocols to support RDMA operations over standard TCP/IP networks. To date, the RDMA Consortium has released four specifications that provide for RDMA over TCP, as follows, each of which is incorporated by reference in its entirety for all intents and purposes:                Hilland et al. “RDMA Protocol Verbs Specification (Version 1.0).” April, 2003. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-rdmac.pdf).        Recio et al. “An RDMA Protocol Specification (Version 1.0),” October 2002. RDMA Consortium, Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-recio-iwarp-rdmap-v1.0.pdf).        Shah et al. “Direct Data Placement Over Reliable Transports (Version 1.0).” October 2002. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-shah-iwarp-ddp-v1.0.pdf).        Culley et al. “Marker PDU Aligned Framing for TCP Specification (Version 1.0).” Oct. 25, 2002. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-culley-iwarp-mpa-v1.0.pdf).        
The RDMA Verbs specification and the suite of three specifications that describe the RDMA over TCP protocol have been completed. RDMA over TCP/IP specifies an RDMA layer that will interoperate over a standard TCP/IP transport layer. RDMA over TCP does not specify a physical layer; but will work over Ethernet, wide area networks (WAN), or any other network where TCP/IP is used. The RDMA Verbs specification is substantially similar to that provided for by INFINIBAND. In addition, the aforementioned specifications have been adopted as the basis for work on RDMA by the Internet Engineering Task Force (IETF). The IETF versions of the RDMA over TCP specifications follow.                “Marker PDU Aligned Framing for TCP Specification (Sep. 27, 2005)” http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-03.pdf        “Direct Data Placement over Reliable Transports (July 2005)” http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-05.txt        “An RDMA Protocol Specification (Jul. 17, 2005)” http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdmap-05.txt        Remote Direct Data Placement (rddp) Working Group http://www.ietf.org/html.charters/rddp-charter.html        
In view of the above developments in the art, it is anticipated that RDMA over TCP/IP, with Ethernet as the underlying network fabric, will over the near term become as ubiquitous within data centers as are currently fielded TCP/IP-based fabrics. The present inventors contemplate that as RDMA over TCP/IP gains prevalence for use as a LAN fabric, data center managers will recognize that increased overall cost of ownership benefits can be had by moving existing SAN and clustering fabrics over to RDMA over TCP/IP as well.
But, as one skilled in the art will appreciate, TCP is a reliable connection transport protocol that provides a stream of bytes, with no inherent capability to demarcate message boundaries for an upper layer protocol (ULP). The RDMA Consortium specifications “Direct Data Placement Over Reliable Transports (Version 1.0)” and “Marker PDU Aligned Framing for TCP Specification (Version 1.0),” among other things specifically define techniques for demarcating RDMA message boundaries and for inserting “markers” into a message, or “protocol data unit” (PDU), that is to be transmitted over a TCP transport byte stream so that an RDMA-enabled network adapter on the receiving end can determine if and when a complete message has been received over the fabric. A framed PDU (FPDU) can contain 0 or more markers. An FPDU is not a message per se. Rather, an FPDU is a portion of a ULP payload that is framed with a marker PDU aligned (MPA) header, that has optional MPA markers inserted at regular intervals in TCP sequence space, and which additionally is padded with up to three octets of zeros (to make the size of the FPDU an integral multiple of four) and has a 32-bit cyclic redundancy check (CRC) appended thereto. The MPA markers are 32-bits and are inserted at 512 octet intervals in the TCP sequence number space. A given MPA marker provides a relative pointer that indicates the number of octets in the TCP sequence stream from the beginning of a corresponding FPDU to the first octet of the given MPA marker. An MPA header provides the length of its corresponding PDU and thus, each MPA marker facilitates location of a corresponding MPA Header, from which a receiver can determine message boundaries for purposes that include computation of the 32-bit CRC. A message consists of one or more direct data placement DDP segments, and has the following general types: Send Message, RDMA Read Request Message, RDMA Read Response Message, and RDMA Write Message. These techniques are required to enhance the streaming capability limitation of TCP and must be implemented by any RDMA-enabled network adapter.
The present inventors have noted that there are several problems associated with implementing an RDMA-enabled network adapter so that PDUs are reliably handled with acceptable latency over an TCP/IP Ethernet fabric. First and foremost, as one skilled in the art will appreciate, TCP does not provide for acknowledgement of messages. Rather, TCP provides for acknowledgement of TCP segments (or partial TCP segments), many of which may be employed to transmit a message under RDMA over TCP/IP. Yet, the RDMAC Verbs Specification requires that an RDMA-enabled adapter provide message completion information to the verbs user in the form of Completion Queue Elements (CQEs). And the CQEs are typically generated using inbound TCP acknowledgements. Thus, it is required that an RDMA-enabled network adapter be capable of rapidly determining if and when a complete message has been received. In addition, the present inventors have noted a requirement for an efficient mechanism to allow for reconstruction and retransmission of TCP segments under normal network error conditions such as dropped packets, timeout, and etc. It is furthermore required that a technique be provided that allows an RDMA-enabled network adapter to efficiently rebuild an FPDU (including correct placement of markers therein) under conditions where the maximum segment size (MSS) for transmission over the network fabric is dynamically changed. The present inventors have also observed that it is desirable to provide a technique for efficiently inserting message markers into TCP segments that are being constructed for transmission and a corresponding technique for removal of markers from received TCP segments
There are additional requirements specified in the above noted RDMAC and IETF specifications that are provided to minimize the number of intermediate buffer copies associated with TCP/IP operations. Direct placement of data that is received out of order (e.g., partial message data) is allowed, but delivery (e.g., “completion”) of messages must be performed in order. More specifically, a receiver may perform placement of received DDP Segments out of order and it furthermore may perform placement of a DDP Segment more than once. But the receiver must deliver complete messages only once and the completed messages must be delivered in the order they were sent. A message is considered completely received if and only if the last DDP segment of the message has its last flag set (i.e., a bit indicating that the corresponding DDP segment is the last DDP segment of the message), all of the DDP segments of the message have been previously placed, and all preceding messages have been placed and delivered.
An RDMA-enabled network adapter can implement these requirements for some types of RDMA messages by using information that is provided directly within the headers of received DDP segments. But the present inventors have observed that other types of RDMA messages (e.g., RDMA Read Response, RDMA Write) do not provide the same type of information within the headers of their respective DDP segments. Consequently, data (i.e., payloads) corresponding to these DDP segments can be directly placed in host memory, yet the information provided within their respective headers cannot be directly employed to uniquely track or report message completions in order as required.
Accordingly, the present inventors have noted that it is desirable to provide apparatus and methods that enable an RDMA-enabled network adapter to effectively and efficiently track and report completions of RDMA messages within a protocol suite that allows for out-of-order placement of data.
And, as alluded to above, the techniques for demarcating RDMA message boundaries by providing MPA headers, inserting MPA markers into a PDU, and post-pending the 32-bit MPA CRC allow a receiver to place data that is received out of order thereby saving a significant amount of intermediate storage, and additionally overcomes the known limitations of TCP checksums, which have been shown to indicate errors at a much higher rate than underlying link characteristics would suggest.
But, given that TCP is a stream-oriented transport protocol, it is highly probable that a packet that is received may have anywhere from zero to approximately 20 MPA markers (depending upon network capabilities) embedded therein, and may comprise a partial PDU, a complete PDU, or a combination of partial and complete PDUs, thereby rendering calculation of the 32-bit MPA CRC difficult at best, particularly at 10 Gb/sec line speeds.
Consequently, the present inventors have noted that it is highly desirable to provide apparatus and methods that enable an RDMA-enabled network adapter to effectively and efficiently perform speculative MPA CRC calculations on arriving packets to preclude latencies that would otherwise be incurred due to the streaming nature of TCP.
In addition to performing these speculative MPA CRC calculations, it is also highly desirable to be able to rapidly locate and remove the MPA markers from a received packet prior to placing the data in user memory. It is also desirable that location and removal of the MPA markers be accomplished without requiring the use of additional buffers. It is furthermore advantageous to quickly locate and insert MPA markers into user data being provided over a host interface as a packet is being constructed for transmission. Insertion of MPA markers into the packet should also be accomplished without requiring the use of additional buffers.