1. Technical Field:
The present invention generally relates to communication protocols between a host computer and an input/output (I/O) device. More specifically, the present invention provides a method by which a Remote Direct Memory Access (RDMA) enabled Network Interface Controller (NIC) can support a redundant configuration consisting of a primary and an alternate RDMA enabled NIC (RNIC).
2. Description of Related Art:
In an Internet Protocol (IP) Network, the software provides a message passing mechanism that can be used to communicate with Input/Output devices, general purpose computers (host), and special purpose computers. The message passing mechanism consists of a transport protocol, an upper level protocol, and an application programming interface. The key standard transport protocols used on IP networks today are the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). TCP provides a reliable service and UDP provides an unreliable service. In the future the Stream Control Transmission Protocol (SCTP) will also be used to provide a reliable service. Processes executing on devices or computers access the IP network through Upper Level Protocols, such as Sockets, iSCSI, and Direct Access File System (DAFS).
Unfortunately the TCP/IP software consumes a considerable amount of processor and memory resources. This problem has been covered extensively in the literature (see J. Kay, J. Pasquale, xe2x80x9cProfiling and reducing processing overheads in TCP/IPxe2x80x9d, IEEE/ACM Transactions on Networking, Vol 4, No. 6, pp. 817-828, December 1996; and D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, xe2x80x9cAn analysis of TCP processing overheadxe2x80x9d, IEEE Communications Magazine, volume: 27, Issue: 6, June 1989, pp 23-29). In the future the network stack will continue to consume excessive resources for several reasons, including: increased use of networking by applications; use of network security protocols; and the underlying fabric bandwidths are increasing at a higher rate than microprocessor and memory bandwidths. To address this problem the industry is offloading the network stack processing to an RDMA enabled NIC (RNIC).
There are two offload approaches being taken in the industry. The first approach uses the existing TCP/IP network stack, without adding any additional protocols. This approach can offload TCP/IP to hardware, but unfortunately does not remove the need for receive side copies. As noted in the papers above, copies are one of the largest contributors to CPU utilization. To remove the need for copies, the industry is pursuing the second approach that consists of adding Framing, Direct Data Placement (DDP), and Remote Direct Memory Access (RDMA) over the TCP and SCTP protocols. The RDMA enabled NIC (RNIC) required to support these two approaches is similar, the key difference being that in the second approach the hardware must support the additional protocols.
The RNIC provides a message passing mechanism that can be used by sockets, iSCSI, and DAFS to communicate between nodes. Processes executing on host computers, or devices, access the IP network by posting send/receive messages to send/receive work queues on an RNIC. These processes also are referred to as xe2x80x9cconsumersxe2x80x9d.
The send/receive work queues (WQ) are assigned to a consumer as a queue pair (QP). The messages can be sent over several different transport types: traditional TCP, RDMA TCP, UDP, or SCTP. Consumers retrieve the results of these messages from a completion queue (CQ) through RNIC send and receive work completion (WC) queues. The source RNIC takes care of segmenting outbound messages and sending them to the destination. The destination RNIC takes care of reassembling inbound messages and placing them in the memory space designated by the destination""s consumer. These consumers use RNIC verbs to access the functions supported by the RNIC. The software that interprets verbs and directly accesses the RNIC is known as the RNIC Interface (RI).
Today, software in the host CPU performs most of the transport (e.g., TCP) and network layer (e.g., IP) processing. Today, the NIC typically performs the link layer (e.g., Ethernet) processing and possibly a modest amount of transport or network layer offload (e.g., Checksum offload). Today, the host software maintains all the state information associated with TCP/IP connections in host local memory. Keeping all the state information in host local memory allows the host software to support switchover, and switchback, between a primary NIC and an alternate NIC. That is, if the primary NIC fails, the host software moves all the connections to the alternate NIC and continues communication processing.
RDMA enabled NICs offer a higher performance interface for communicating to other general purpose computers and I/O devices. RNICs offload the transport (e.g., TCP) and network (e.g., IP) layer into the RNIC. By migrating these layers into the RNIC, the host software is no longer able to support switchover and switchback using today""s mechanisms. Therefore, a simple mechanism is needed to allow RNICs to support switchover and switchback of reliable transport (e.g. TCP) connections and allow communications to continue as a result of a planned or unplanned RNIC outage.
The present invention provides a method, computer program product, and distributed data processing system for supporting RNIC switchover and switchback. The distributed data processing system comprises end nodes, switches, routers, and links interconnecting the components. The end nodes use send and receive queue pairs to transmit and receive messages. The end nodes segment the message into segments and transmit the segments over the links. The switches and routers interconnect the end nodes and route the segments to the appropriate end nodes. The end nodes reassemble the segments into a message at the destination.
The present invention provides a mechanism for supporting RNIC (RDMA enabled NIC) switchover and switchback. Using the mechanism provided in the present invention when a planned or unplanned outage occurs on a primary RNIC, all outstanding connections are switched over to an alternate RNIC, and the alternate RNIC continues communication processing. Additionally, using the mechanism provided in the present invention, connections can also be switched back.