1. Field of the Invention
This invention relates in general to the field of computer communications and more specifically to an apparatus and method for accelerating TCP/IP connections over an Ethernet fabric that is enabled to accomplish remote direct memory access (RDMA) operations.
2. Description of the Related Art
The first computers were stand-alone machines, that is, they loaded and executed application programs one-at-a-time in an order typically prescribed through a sequence of instructions provided by keypunched batch cards or magnetic tape. All of the data required to execute a loaded application program was provided by the application program as input data and execution results were typically output to a line printer. Even though the interface to early computers was cumbersome at best, the sheer power to rapidly perform computations made these devices very attractive to those in the scientific and engineering fields.
The development of remote terminal capabilities allowed computer technologies to be more widely distributed. Access to computational equipment in real-time fostered the introduction of computers into the business world. Businesses that processed large amounts of data, such as the insurance industry and government agencies, began to store, retrieve, and process their data on computers. Special applications were developed to perform operations on shared data within a single computer system.
During the mid 1970's, a number of successful attempts were made to interconnect computers for purposes of sharing data and/or processing capabilities. These interconnection attempts, however, employed special purpose protocols that were intimately tied to the architecture of these computers. As such, the computers were expensive to procure and maintain and their applications were limited to those areas of the industry that heavily relied upon shared data processing capabilities.
The U.S. government, however, realized the power that could be harnessed by allowing computers to interconnect and thus funded research that resulted in what we now know as the Internet. More specifically, this research resulted in a series of standards produced that specify the details of how interconnected computers are to communicate, how to interconnect networks of computers, and how to route traffic over these interconnected networks. This set of standards is known as the TCP/IP Internet Protocol Suite, named after its two predominant protocol standards, Transport Control Protocol (TCP) and Internet Protocol (IP). TCP is a protocol that allows for a reliable byte stream connection between two computers. IP is a protocol that provides an addressing and routing mechanism for unreliable transmission of datagrams across a network of computers. The use of TCP/IP allows a computer to communicate across any set of interconnected networks, regardless of the underlying native network protocols that are employed by these networks. Once the interconnection problem was solved by TCP/IP, networks of interconnected computers began to crop up in all areas of business.
The ability to easily interconnect computer networks for communication purposes provided the motivation for the development of distributed application programs, that is, application programs that perform certain tasks on one computer connected to a network and certain other tasks on another computer connected to the network. The sophistication of distributed application programs has steadily evolved over more recent years into what we today call the client-server model. According to the model, “client” applications on a network make requests for service to “server” applications on the network. The “server” applications perform the service and return the results of the service to the “client” over the network. In an exact sense, a client and a server may reside on the same computer, but the more common employment of the model finds clients executing on smaller, less powerful, less costly computers connected to a network and servers executing on more powerful, more expensive computers. In fact, the proliferation of client-server applications has resulted in a class of high-end computers being known as “servers” because they are primarily used to execute server applications. Similarly, the term “client machine” is often used to describe a single-user desktop system that executes client applications. Client-server application technology has enabled computer usage to be phased into the business mainstream. Companies began employing interconnected client-server networks to centralize the storage of files, company data, manufacturing data, etc., on servers and allowed employees to access this data via clients. Servers today are sometimes known by the type of services that they perform. For example, a file server provides client access to centralized files, a mail server provides access to a companies electronic mail, a data base server provides client access to a central data base, and so on.
The development of other technologies such as hypertext markup language (HTML) and extensible markup language (XML) now allows user-friendly representations of data to be transmitted between computers. The advent of HTML/XML-based developments has resulted in an exponential increase in the number of computers that are interconnected because, now, even home-based businesses can develop server applications that provide services accessible over the Internet from any computer equipped with a web browser application (i.e., a web “client”). Furthermore, virtually every computer produced today is sold with web client software. In 1988, only 5,000 computers were interconnected via the Internet. In 1995, under 5 million computers were interconnected via the Internet. But with the maturation of client-server and HTML technologies, presently, over 50 million computers access the Internet. And the growth continues.
The number of servers in a present day data center may range from a single server to hundreds of interconnected servers. And the interconnection schemes chosen for those applications that consist of more than one server depend upon the type of services that interconnection of the servers enables Today, there are three distinct interconnection fabrics that characterize a multi-server configuration. Virtually all multi-server configurations have a local area network (LAN) fabric that is used to interconnect any number of client machines to the servers within the data center. The LAN fabric interconnects the client machines and allows the client machines access to the servers and perhaps also allows client and server access to network attached storage (NAS), if provided. One skilled in the art will appreciate that TCP/IP over Ethernet is the most commonly employed protocol in use today for a LAN fabric, with 100 Megabit (Mb) Ethernet being the most common transmission speed and 1 Gigabit (Gb) Ethernet gaining prevalence in use. In addition, 10 Gb Ethernet links and associated equipment are currently being fielded.
The second type of interconnection fabric, if required within a data center, is a storage area network (SAN) fabric. The SAN fabric provides for high speed access of block storage devices by the servers. Again, one skilled in the art will appreciate that Fibre Channel is the most commonly employed protocol for use today for a SAN fabric, transmitting data at speeds up to 2 Gb per second, with 4 Gb per second components that are now in the early stages of adoption.
The third type of interconnection fabric, if required within a data center, is a clustering network fabric. The clustering network fabric is provided to interconnect multiple servers to support such applications as high-performance computing, distributed databases, distributed data store, grid computing, and server redundancy. A clustering network fabric is characterized by super-fast transmission speed and low-latency. There is no prevalent clustering protocol in use today, so a typical clustering network will employ networking devices developed by a given manufacturer. Thus, the networking devices (i.e., the clustering network fabric) operate according to a networking protocol that is proprietary to the given manufacturer. Clustering network devices are available from such manufacturers as Quadrics Inc. and Myricom. These network devices transmit data at speeds greater than 1 Gb per second with latencies on the order of microseconds. It is interesting, however, that although low latency has been noted as a desirable attribute for a clustering network, more than 50 percent of the clusters in the top 500 fastest computers today use TCP/IP over Ethernet as their interconnection fabric.
It has been noted by many in the art that a significant performance bottleneck associated with networking in the near term will not be the network fabric itself, as has been the case in more recent years. Rather, the bottleneck is now shifting to the processor. More specifically, network transmissions will be limited by the amount of processing required of a central processing unit (CPU) to accomplish TCP/IP operations at 1 Gb (and greater) speeds. In fact, the present inventors have noted that approximately 40 percent of the CPU overhead associated with TCP/IP operations is due to transport processing, that is, the processing operations that are required to allocate buffers to applications, to manage TCP/IP link lists, etc. Another 20 percent of the CPU overhead associated with TCP/IP operations is due to the processing operations which are required to make intermediate buffer copies, that is, moving data from a network adapter buffer, then to a device driver buffer, then to an operating system buffer, and finally to an application buffer. And the final 40 percent of the CPU overhead associated with TCP/IP operations is the processing required to perform context switches between an application and its underlying operating system which provides the TCP/IP services. Presently, it is estimated that it takes roughly 1 GHz of processor bandwidth to provide for a typical 1 Gb/second TCP/IP network. Extrapolating this estimate up to that required to support a 10 Gb/second TCP/IP network provides a sufficient basis for the consideration of alternative configurations beyond the TCP/IP stack architecture today, most of the operations of which are provided by an underlying operating system.
As alluded to above, it is readily apparent that TCP/IP processing overhead requirements must be offloaded from the processors and operating systems within a server configuration in order to alleviate the performance bottleneck associated with current and future networking fabrics. This can be accomplished in principle by 1) moving the transport processing requirements from the CPU down to a network adapter; 2) providing a mechanism for remote direct memory access (RDMA) operations, thus giving the network adapter the ability to transfer data directly to/from application memory; and 3) providing a user-level direct access technique that allows an application to directly command the network adapter to send/receive data, thereby bypassing the underlying operating system.
The INFINIBAND™ protocol was an ill-fated attempt to accomplish these three “offload” objectives, while at the same time attempting to increase data transfer speeds within a data center. In addition, INFINIBAND attempted to merge the three disparate fabrics (i.e., LAN, SAN, and cluster) by providing a unified point-to-point fabric that, among other things, completely replaced Ethernet, Fibre Channel, and vendor-specific clustering networks. On paper and in simulation, the INFINIBAND protocol was extremely attractive from a performance perspective because it enabled all three of the above objectives and increased networking throughput overall. Unfortunately, the architects of INFINIBAND overestimated the community's willingness to abandon their tremendous investment in existing networking infrastructure, particularly that associated with Ethernet fabrics. And as a result, INFINIBAND has not become a viable option for the marketplace.
INFINIBAND did, however, provide a very attractive mechanism for offloading reliable connection network transport processing from a CPU and corresponding operating system. One aspect of this mechanism is the use of “verbs.” Verbs is an architected programming interface between a network input/output (I/O) adapter and a host operating system (OS) or application software, which enables 1) moving reliable connection transport processing from a host CPU to the I/O adapter; 2) enabling the I/O adapter to perform direct data placement (DDP) through the use of RDMA read messages and RDMA write messages, as will be described in greater detail below; and 3) bypass of the OS. INFINIBAND defined a new type of reliable connection transport for use with verbs, but one skilled in the art will appreciate that a verbs interface mechanism will work equally well with the TCP reliable connection transport. At a very high level, this mechanism consists of providing a set of commands (“verbs”) which can be executed by an application program, without operating system intervention, that direct an appropriately configured network adapter (not part of the CPU) to directly transfer data to/from server (or “host”) memory, across a network fabric, where commensurate direct data transfer operations are performed in host memory of a counterpart server. This type of operation, as noted above, is referred to as RDMA, and a network adapter that is configured to perform such operations is referred to as an RDMA-enabled network adapter. In essence, an application executes a verb to transfer data and the RDMA-enabled network adapter moves the data over the network fabric to/from host memory.
Many in the art have attempted to preserve the attractive attributes of INFINIBAND (e.g., reliable connection network transport offload, verbs, RDMA) as part of a networking protocol that utilizes Ethernet as an underlying network fabric. In fact, over 50 member companies are now part of what is known as the RDMA Consortium (www.rdmaconsortium.org), an organization founded to foster industry standards and specifications that support RDMA over TCP. RDMA over TCP/IP defines the interoperable protocols to support RDMA operations over standard TCP/IP networks. To date, the RDMA Consortium has released four specifications that provide for RDMA over TCP, as follows, each of which is incorporated by reference in its entirety for all intents and purposes:                Hilland et al. “RDMA Protocol Verbs Specification (Version 1.0).” April, 2003. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-rdmac. pdf).        Recio et al. “An RDMA Protocol Specification (Version 1.0).” October 2002. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-recio-iwarp-rdmap-v1.0.pdf).        Shah et al. “Direct Data Placement Over Reliable Transports (Version 1.0).” October 2002. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-shah-iwarp-ddp-v1.0.pdf).        Culley et al. “Marker PDU Aligned Framing for TCP Specification (Version 1.0).” Oct. 25, 2002. RDMA Consortium. Portland, Oreg. (http://www.rdmaconsortium.org/home/draft-culley-iwarp-mpa-v1.0.pdf).        
The RDMA Verbs specification and the suite of three specifications that describe the RDMA over TCP protocol have been completed. RDMA over TCP/IP specifies an RDMA layer that will interoperate over a standard TCP/IP transport layer. RDMA over TCP does not specify a physical layer; but will work over Ethernet, wide area networks (WAN), or any other network where TCP/IP is used. The RDMA Verbs specification is substantially similar to that provided for by INFINIBAND. In addition, the aforementioned specifications have been adopted as the basis for work on RDMA by the Internet Engineering Task Force (IETF). The IETF versions of the RDMA over TCP specifications follow.                “Marker PDU Aligned Framing for TCP Specification (Sep. 27, 2005)” http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-03.pdf        “Direct Data Placement over Reliable Transports (July 2005)” http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-05.txt        “An RDMA Protocol Specification (Jul. 17, 2005)” http://www.ietf.org/internet-drafts/draft-ietf-rddp-rdmap-05.txt        Remote Direct Data Placement (rddp) Working Group http://www.ietf.org/html.charters/rddp-charter.html        
In view of the above developments in the art, it is anticipated that RDMA over TCP/IP, with Ethernet as the underlying network fabric, will over the near term become as ubiquitous within data centers as are currently fielded TCP/IP-based fabrics. The present inventors contemplate that as RDMA over TCP/IP gains prevalence for use as a LAN fabric, data center managers will recognize that increased overall cost of ownership benefits can be had by moving existing SAN and clustering fabrics over to RDMA over TCP/IP as well.
But, as one skilled in the art will appreciate, TCP is a reliable connection transport protocol that provides a stream of bytes, with no inherent capability to demarcate message boundaries for an upper layer protocol (ULP). The RDMA Consortium specifications “Direct Data Placement Over Reliable Transports (Version 1.0)” and “Marker PDU Aligned Framing for TCP Specification (Version 1.0),” among other things specifically define techniques for demarcating RDMA message boundaries and for inserting “markers” into a message, or “protocol data unit” (PDU) that is to be transmitted over a TCP transport byte stream so that an RDMA-enabled network adapter on the receiving end can determine if and when a complete message has been received over the fabric. A marked PDU is referred to as a framed PDU (FPDU). An FPDU, however, is not a message per se. Rather, an FPDU is a portion of a ULP payload that is framed with a marker PDU aligned (MPA) header, and that has MPA markers inserted at regular intervals in TCP sequence space. The MPA markers are inserted to facilitate location of the MPA Header. A message consists of one or more direct data placement DDP segments, and has the following general types: Send Message, RDMA Read Request Message, RDMA Read Response Message, and RDMA Write Message. These techniques are required to enhance the streaming capability limitation of TCP and must be implemented by any RDMA-enabled network adapter.
The present inventors have noted that there are several problems associated with implementing an RDMA-enabled network adapter so that PDUs are reliably handled with acceptable latency over an TCP/IP Ethernet fabric. First and foremost, as one skilled in the art will appreciate, TCP does not provide for acknowledgement of messages. Rather, TCP provides for acknowledgement of TCP segments (or partial TCP segments), many of which may be employed to transmit a message under RDMA over TCP/IP. Yet, the RDMAC Verbs Specification requires that an RDMA-enabled adapter provide message completion information to the verbs user in the form of Completion Queue Elements (CQEs). And the CQEs are typically generated using inbound TCP acknowledgements. Thus, it is required that an RDMA-enabled network adapter be capable of rapidly determining if and when a complete message has been received. In addition, the present inventors have noted a requirement for an efficient mechanism to allow for reconstruction and retransmission of TCP segments under normal network error conditions such as dropped packets, timeout, and etc. It is furthermore required that a technique be provided that allows an RDMA-enabled network adapter to efficiently rebuild an FPDU (including correct placement of markers therein) under conditions where the maximum segment size (MSS) for transmission over the network fabric is dynamically changed.