Computers and other devices are commonly interconnected to facilitate communication among one another using any one of a number of available standard network architectures and any one of several corresponding and compatible network protocols. The physical nature of standard architectures and their topologies is typically dictated at the first two layers of the OSI (Open Systems Interconnection) Basic Reference Model for networks; they are known as the physical layer (layer 1) and the data link layer (layer 2). One of the most commonly deployed of such standard architectures is the Ethernet® network architecture Other types of network architectures that are less widely implemented include ARCnet, Token Ring and FDDI. Variations of the Ethernet® standard are differentiated from one another based on characteristics such as maximum throughput (i.e. the highest data transmission rate) of devices coupled to the network, the type of medium used for physically interconnecting the devices (e.g. coaxial cable, twisted pair cable, optical fibers, etc.) to the network and the maximum permissible length of the medium.
Packet switched network protocols are often employed over the physical and link layers described above. They dictate the formatting of data into packets by which data can be transmitted over the network using virtual connections established between peer applications running on devices coupled to the network. They also dictate the manner in which these virtual connections are established and torn down. These protocols are defined by layer 3 (network layer) and layer 4 (transport layer) of the OSI and typically reside in the operating system of the host computer system.
Conventionally, the operating system (O/S) of each network device executes instances of the transport and network protocols. This is sometimes referred to herein as the conventional protocol stack. Executing the transport protocol at the behest of applications local to the O/S of each device facilitates the establishment and management of virtual connections between the local applications and peer applications running on other nodes of the network (layer 4). TCP is a commonly deployed transport protocol involved in the establishment and management of such virtual connections. Executing a network protocol on behalf of applications local to the O/S of each device facilitates the formatting of payload data derived from the local applications for transmission to remote applications running on the other nodes with which the local applications are virtually connected. Executing network protocols on behalf of applications local to the O/S of each device also facilitates the extraction of payload data received from virtually connected applications running on the other nodes of the network (layer 3). IP is a commonly deployed network protocol. TCP/IP is a layer4/layer3 protocol combination commonly used in Internet applications as well as intranet applications such as local area networks (LANs).
For a conventional protocol stack as described above, data to be transmitted by a local application to a remote node with which it is virtually connected is first copied from an application buffer in the host memory to a temporary protocol buffer and it is this copy that is then formatted and transmitted by the protocol stack out over the network. Likewise, data received by the host over the network from the remote application is de-formatted and a copy of the data is then stored in a protocol buffer. An application buffer associated with the target application is then notified of the availability of the deformatted data and the data is eventually copied into the application buffer in the host memory by the O/S at the request of the destination application. As the number of network transactions and the amount of data per transaction increases, the demand on the host processor in performing the foregoing functions increases commensurately.
Network interface resources are typically required to physically couple computers and other devices to a network. These interface resources are sometimes referred to as network adapters or network interface cards (NICs). Each adapter or NIC has at least one bi-directional port through which a physical link can be provided between the network transmission medium and the processing resources of the network device. Data is communicated (as packets in the case of packet switched networks) between the virtually connected applications running on two or more network devices. The data is electronically transmitted and received through these interface resources and over the media used to physically couple the devices together. The network adapters typically provide the data link and physical layers of the interconnect standard. Adapter cards or NICs are commercially available in various product configurations that are designed to support one or more variations of standard architectures and known topologies.
Each device coupled to a network is identified by one or more “publicly” known addresses by which other devices on the network know to communicate with it. Each address corresponds to one of the layers of the OSI model and is embedded in the packets for both the source device that generated the packet as well as the destination device(s) for which the packet is intended. For Ethernet networks, a network device will use an address at layer 2 (the data link layer) known as a MAC (media access control) address to differentiate between the NICs and/or NIC ports of the other devices on the network with which it communicates. In addition, one or more protocol addresses at layer 3 (the network layer, e.g. IP, IPX, AppleTalk, etc.) are used to identify each of one or more instances of the network layer protocol(s) running on the device (for IP this is often referred to as an “IP address”). This number is also sometimes referred to as a host number when used to identify the endpoints of connections between devices on the network.
Each of the network devices can have one or more NICs/NIC ports. Each NIC/NIC port can be coupled to a different network/sub-network, or they can be teamed to operate as a single virtual NIC port that can have an aggregate throughput greater than any of the NICs/NIC ports operating individually. The teaming of NICs was motivated by the ever increasing desire for throughput between devices on networks. For a description of techniques used in support of NIC teaming to achieve increased throughput and/or fault tolerance, see for example U.S. Pat. No. 6,272,113 entitled “Network Controller System that uses Multicast Heartbeat Packets,” which was issued on Aug. 7, 2001.
When operating independently, an interface is created (i.e. “exposed”) to the protocol stack residing in the O/S that couples (in a virtual sense) the driver for each NIC/NIC port and the instance of IP (or other network protocol) that is part of the protocol stack. Each IP address is associated with a different NIC/NIC port or team. Each interface is typically identified by a different protocol address and packets to be transmitted and received through a particular one of the exposed interfaces carry its IP address as their source and destination IP addresses respectively. Because each interface exposed to the protocol stack is associated with its own IP address and its own NIC or team of NICs, a particular network device coupled to different networks or sub-networks (i.e. residing in different domains) will be addressed with a different protocol address. This IP address will be the one that corresponds to the NIC and interface coupling the device to the network or sub-network.
Two or more NICs/NIC ports may be teamed together to aggregate their resources, balance traffic over the team members and provide redundancy for fault tolerance. This can be accomplished by interposing an intermediate driver between the individual drivers for each NIC of the team and the instance of IP comprising the O/S protocol stack. This intermediate driver makes the multiple drivers of the team members appear as a single driver to the instance of IP, and thus only a single interface need be exposed to the protocol stack for the team. All members of the team therefore share the same IP address within a given domain.
A single NIC/NIC port or team can also be securely shared over two or more networks or sub-networks by interfacing the NIC or NIC team through a VLAN switch. Each VLAN to which the NIC or NIC team is assigned corresponds to a virtual interface to the instance of IP comprising the conventional protocol stack. As above, each virtual interface is assigned a different IP address. The VLAN switch is able to differentiate the destinations among the packets it receives from a NIC or NIC team shared among a plurality of networks through the use of a VLAN tag that is added to each frame transmitted from the shared NIC or NIC team. Packets received from the network by the shared NIC or NIC team all have the same destination MAC address and are therefore differentiated based on their destination IP addresses. Thus, the same NIC or NIC team may be shared across several VLAN subnets through a switch, while maintaining isolation between those VLAN subnets.
Each NIC or NIC port is associated with its own unique MAC address and because devices on a contiguous layer 2 Ethernet network communicate directly using these MAC addresses, they must first resolve IP addresses to MAC addresses. A network device wishing to establish a virtual connection with another peer device first consults a cache of MAC address/IP address pairs (an ARP table) that it maintains for network devices with which its has previously communicated. If no MAC address resides in the requestor's ARP table, the requestor broadcasts an ARP request to the other devices on the network that specifies the IP address of the device with which it wishes to communicate. The device identified by the IP address in the ARP request responds to the requestor by sending back the MAC address for the NIC or NIC team associated with the specified IP address. This process is known as ARP (Address Resolution Protocol), the result of which is then stored in the requestor's cache for future reference.
The MAC address can be thought of as uniquely identifying the physical hardware of the network resource (i.e. each NIC or NIC port providing a link to the network has its own unique MAC address) whereas the protocol address (e.g. IP address) identifies an interface exposed to instance of the network protocol software residing in the O/S of the host device. For a team of two or more NICs/NIC ports, the team's shared IP address is always resolved to a single MAC address (i.e. the response to an ARP on the team IP address is always the same team MAC address) so that it looks like a single physical interface to other devices on the network. This team MAC address can be any one of the MAC addresses uniquely associated with one of the individual team members. On the transmit side, the packets generated by the local applications can be transmitted through any one of the members of the team to achieve load-balancing of outgoing traffic. This is known as transmit load balancing (TLB). On the receive side, all traffic is received by the single NIC/NIC port having the team MAC address as its own.
Source and destination addresses for packets are derived during the establishment of peer-to-peer virtual connections between applications running on different network devices. The connections are defined by two (e.g. local and remote) endpoints, each endpoint designated by an IP address (or host number) and a port number. The IP address or host number for each endpoint identifies a particular interface to an instance of TCP/IP running on each of the devices between which the connection is established. The port number identifies the two applications running on each of the devices between which the data transferred over the connection is exchanged. The transport address information (i.e. host #, port #) defining each endpoint becomes the source and destination tuple or transport address within each packet transmitted over that connection. As a point of reference, the transport address for the server node is referred to herein as the destination transport address and the transport address for the client node is referred to as the source transport address.
Because there is an ever-increasing demand for maximum network performance and availability, particularly with the advent of applications such as clustered database servers and clustered applications servers, more and more data must be handled by the systems acting as servers in these types of applications. This can include the sharing of large amounts of data among the processing nodes of the cluster. The demands of such applications have motivated computer system developers to team or aggregate network interface resources such as NICs/NIC ports both to increase the data throughput rate at the network interface as well as to provide fault tolerance for improved system availability. Load-balancing the data over the teamed resources has also been employed to optimize the aggregated throughput.
Teaming network resources has led to increased data throughput at the network interface. The ever-increasing level of Central Processor Unit (CPU) performance has further improved network device performance. Notwithstanding, the overall impact of these improvements on network performance has been tempered by the fact that these improvements have significantly outpaced improvements in memory access speed. Memory access speed has become the predominant limiting factor. Additionally, an ever-increasing percentage of CPU processing capacity is now being devoted to processing network I/O transactions. As previously described, this includes both packet formatting/de-formatting operations as well as data copying operations. Thus, as the amount of data to be transferred keeps increasing, the positive impact of improved processor performance and network interface throughput is limited by the commensurate increase in the number and size of copy operations and their requisite demand on memory bandwidth.
One general approach to alleviating the memory bandwidth bottleneck and the ever-increasing demand placed generally on the processing resources of the host CPU is to establish network connections that bypass the conventional protocol stack traditionally residing in the O/S (sometimes referred to herein as the conventional or O/S protocol stack). These connections can be established over a bypass protocol stack residing outside of the host operating system to facilitate direct placement of data between the buffer memories of server and client nodes over the network. Connections that bypass the O/S based protocol stack can eliminate the aforementioned copying operations and also offload from the CPU the processing overhead normally associated with the formatting and de-formatting of such transactions. The processing capacity that can be freed up offloading these connections can be applied to other tasks such as servicing applications and users.
One set of technologies that has been developed to facilitate offloaded connections is often referred to generally as Remote Direct Memory Access (RDMA) over TCP/IP. Other technologies such as InfiniBand® also have been proposed and implemented to accomplish direct data placement (DDP), but they employ a network infrastructure that is not compatible with the existing (and widely deployed) network infrastructures such as TCP/IP over Ethernet.
Recently, the RDMA Consortium has been overseeing the development of standards by which RDMA may be implemented using TCP/IP as the upper layer protocol over Ethernet as the data link and physical layer. Various specifications for RDMA standards established by the RDMA Consortium are publicly available at www.rdmaconsortium.org. One of these technologies is a transport protocol called Sockets Direct Protocol (SDP) that extends the functionality of Sockets APIs to facilitate the establishment of both conventional TCP/IP connections as well as offloaded DDP connections. SDP emulates the semantics typically used in legacy applications written to use Sockets APIs over TCP in multiple O/S environments and therefore executes its functionality transparently with respect to legacy applications. Another such extension to Sockets API functionality is a precursor to SDP called Windows Sockets Direct (WSD) protocol, which is only available on the Windows Operating System. SDP and WSD enable legacy Sockets applications to use standard Sockets APIs such as listen, connect and accept to transparently establish offloaded connections when such connections are supported by both connecting endpoint processing nodes.
SDP and WSD are essentially libraries that intercept standard Sockets APIs and execute extended processes in response thereto to establish those offloaded connections in a manner transparent to the legacy applications. Thus, such protocol extensions enable legacy applications that speak Sockets to unwittingly set up RDMA connections between those applications when both connecting devices are configured to support them. If RDMA connections are not supported by both of the connecting nodes, the connections established between the applications running on those nodes simply default to the conventional connections established through the O/S protocol stack.
Physical connectivity to the network for offloaded connections (e.g. RDMA) is usually accomplished through a specialized network interface card often referred to as an RNIC (RDMA NIC). Each RNIC typically has its own protocol stack that includes its own instantiations of the upper layer protocols (e.g. TCP/IP), as well as the link layer and the physical layer for providing a physical RDMA link to the network. For an RNIC, direct data placement (DDP) protocols typically reside above the conventional upper layer protocols. The DDP protocols add placement information to outgoing packets over an offloaded connection to provide the RNIC at the receiving node with buffer name and location information for direct placement of the data into its buffer memory. In this way, the copy operations conventionally performed by the O/S are avoided because data is taken directly from a defined point in the application buffer for one peer application, is transmitted over the network, and then is directly placed at a defined point into the application buffer of another peer application. Likewise, the DDP protocols at the receiving end of an offloaded connection decode the placement information for direct data placement. Each RNIC also maintains connection state information for each connection established through it that facilitates communication with the user space and coordinates transfer of the data to the application that is the target of the directly placed data.
It is possible that not all devices on a particular network will be capable of supporting offloaded connections. In that case, a server may receive requests from some clients that are not properly configured to support offloaded connections and therefore seek conventional connections through the server's O/S protocol stack. Hybrid RNICs are commercially available that can present a single physical interface to the network, but provide two distinct internal pathways over which data connections may be established and through which packets transmitted over those connections may be processed. One internal pathway supports the establishment of conventional connections as previously described, while the other supports establishment of offloaded connections.
The pathway for traditional connections can provide the conventional components previously described, including NIC drivers, the software by which interfaces to the conventional O/S protocol stack are established and managed, as well as instances of an intermediate driver by which teams of NICs may be established that load-balance transmitted data and perform failover to achieve redundancy. The pathway for offloaded connections through each RNIC can include separate instances of the protocol stack and direct data placement protocols by which the offloaded connections may bypass the O/S stack as previously described. Combining the two types of traffic through the same physical interface presents significant challenges when attempting to implement failover and load-balancing of the conventional traffic, particularly when the RNICs are also teamed for purposes of aggregating their offloaded connection capacities.