Computers and other devices are commonly interconnected to facilitate communication among one another using any one of a number of available standard network architectures and any one of several corresponding and compatible network protocols. The physical nature of standard architectures and their topologies is typically dictated at the first two layers of the OSI (Open Systems Interconnection) Basic Reference Model for networks; they are known as the physical layer (layer 1) and the data link layer (layer 2). One of the most commonly deployed of such standard architectures is the Ethernet® network architecture. Other types of network architectures that are less widely implemented include ARCnet, Token Ring and FDDI. Variations of the Ethernet® standard are differentiated from one another based on characteristics such as maximum throughput (i.e. the highest data transmission rate) of devices coupled to the network, the type of medium used for physically interconnecting the devices (e.g. coaxial cable, twisted pair cable, optical fibers, etc.) to the network and the maximum permissible length of the medium.
Network connection speeds have been increasing at a substantial rate. The 10Base-T and 100Base-T Ethernet® standards, for example, designate a maximum throughput of 10 and 100 Megabits per second respectively, and are coupled to the network over twisted pair cable. The 1000Base-T (or Gigabit) Ethernet® standard designates a maximum throughput of 1000 Mbps (i.e. a Gigabit per second) over twisted pair cable. Continued advancement in the speed of integrated circuits has facilitated the development of even faster variations of the Ethernet® network architecture, such as one operating at 10 Gigabits per second (10 Gbps) and for which the transmission medium is typically optical fibers. Of course, the greater the throughput, the more expensive the network resources required to sustain that throughput. Ethernet® is a registered trademark of Xerox Corporation.
Packet switched network protocols are often employed over the physical and link layers described above. They dictate the formatting of data into packets by which data can be transmitted over the network using virtual connections established between peer applications running on devices coupled to the network. They also dictate the manner in which these virtual connections are established and torn down. These protocols are defined by layer 3 (network layer) and layer 4 (transport layer) of the OSI and typically reside in the operating system of the host computer system. Thus, the operating system traditionally executes instances of the transport protocols to perform the processes required to establish and manage virtual connections between peer applications running on the nodes of the network at the behest of those applications (layer 4). Further, the O/S executes instances of the network protocols to format/deformat payload data derived from the connected applications in preparation for transmitting/receiving the data over the network on behalf of the connected peer applications (layer 3).
Traditionally, data to be transmitted by the local application to the remote node over such a connection is first copied from an application buffer in the host memory to a temporary protocol buffer and it is this copy that is then formatted and transmitted by the protocol stack out over the network. Likewise, data received by the host over the network from the remote application is de-formatted and a copy of the data is then stored in a protocol buffer. An application buffer associated with the target application is then notified of the availability of the deformatted data, and is eventually copied into the application buffer in the host memory by the O/S at the request of the destination application.
The upper layer protocols (i.e. the network and transport layers) are typically independent of the lower layers (i.e. the data link and physical layers) by virtue of the hierarchical nature of the OSI. Examples of network layers include Internet Protocol (IP), the Internet Protocol eXchange (IPX), NetBEUI and the like. NetBEUI is short for NetBIOS Enhanced User Interface, and is an enhanced version of the NetBIOS protocol used by network operating systems such as LAN Manager, LAN Server, Windows® for Workgroups, Windows®95 and Windows NT®. Windows® and Windows NT® are registered trademarks of Microsoft Corporation. NetBEUI was originally designed by IBM for IBM's LAN Manager Server and later extended by Microsoft and Novell. TCP is a commonly deployed transport protocol involved in the establishment and management of virtual connections between peer applications as previously discussed. TCP/IP is a layer4/layer3 combination commonly used in Internet applications, or in intranet applications such as a local area network (LAN).
One of the most basic and widely implemented network types is the Local Area Network (LAN). In its simplest form, a LAN is a number of devices (e.g. computers, printers and other specialized peripherals) connected to one another by some form of signal transmission medium such as coaxial cable to facilitate direct peer-to-peer communication there between. A common network paradigm, often employed in LANs as well as other networks, is known as the client/server paradigm. This paradigm involves coupling one or more large computers (typically having very advanced processing and storage capabilities) known as servers to a number of smaller computers (such as desktops or workstations) and other peripheral devices shared by the computers known as clients.
Applications running on the client nodes send requests over the network to one or more servers to access service applications running on the server. These service applications facilitate operations such as centralized information storage and retrieval, database management and file transfer functions. Servers may also be used to provide centralized access to other networks and to various other services as are known to those of skill in the art. The applications running on the servers provide responses over the network to the clients in response to their applications' requests. These responses often involve large transfers of data. Clients and/or servers can also share access to peripheral resources, such as printers, scanners, and the like over the network.
More generally, a server can be deemed any processing node on the network that provides service applications to which applications running on other processing nodes may request connections, and a client can be deemed any processing node that is requesting such connections. It therefore follows that any processing node can be both a server and a client, depending upon its behavior at any moment. To establish a consistent point of reference for further discussions, a processing node henceforth will be deemed a server when listening for and accepting connections (i.e. acting as a connectee) and will be deemed a client when its applications are requesting connections to such applications on another node.
Network interface resources are typically required to physically couple computers and other devices to a network. These interface resources are sometimes referred to as network adapters or network interface cards (NICs). Each adapter or NIC has at least one bi-directional port through which a physical link can be provided between the network transmission medium and the processing resources of the network device. Data is communicated (as packets in the case of packet switched networks) between the virtually connected applications running on two or more network devices. The data is electronically transmitted and received through these interface resources and over the media used to physically couple the devices together. The network adapters typically provide the data link and physical layers of the interconnect standard. Adapter cards or NICs are commercially available in various product configurations that are designed to support one or more variations of standard architectures and known topologies.
Each of the network devices typically includes a bus system through which the processing resources of the network devices may be coupled to the NICs. The bus system is usually coupled to the pins of edge connectors defining sockets for expansion slots. The NICs are coupled to the bus system of the network device by plugging the NIC into the edge connector of the expansion slot. In this way, the processing resources of the network devices are in communication with any NICs or network adapter cards that are plugged into the expansion slots of that network device. As previously mentioned, each NIC or network adapter must be designed in accordance with the standards by which the network architecture and topology are defined to provide appropriate signal levels and impedances (i.e. the physical layer) to the network. This of course includes an appropriate physical connector for interfacing the NIC to the physical transmission medium employed for the network (e.g. coaxial cable, twisted-pair cable, fiber optic cable, etc.).
Each device on a network is identified by one or more “publicly” known addresses by which other devices on the network know to communicate with it. Each address corresponds to one of the layers of the OSI model and is embedded in the packets for both the source device that generated the packet as well as the destination device(s) for which the packet is intended. For Ethernet networks, a network device will use an address at layer 2 (the data link layer) known as a MAC (media access control) address to differentiate between the NICs and/or NIC ports included in the expansion slots of the network device. In addition, one or more protocol addresses at layer 3 (the network layer, e.g. IP, IPX, AppleTalk, etc.) known as a host number (for IP this is often referred to as an “IP address”) are used to identify each of one or more instances of the network layer protocol(s) running on the device.
Each of the network devices can have multiple NICs/NIC ports, each of which can operate independently or that may be teamed as a single virtual NIC port. When operating individually, each NIC or NIC port is typically coupled to a separate network or sub-network, and each exposes an interface to the instance of IP (or other network protocol) that is part of the protocol stack residing in the O/S. Each exposed interface to the instance of IP is usually associated with its own IP address. Therefore devices having NICs coupled to different networks or sub-networks (i.e. residing in different domains) typically will be addressed using different host numbers within those different domains. Two or more NICs/NIC ports can be teamed together to aggregate resources, balance traffic over the team members and provide fault tolerance. In this case, an intermediate driver is implemented that makes the individual NIC drivers look like a single driver to a shared instance of IP. Thus, all members of a team share at least one IP address in a given domain. A single NIC/NIC port or a team of NICs/NIC ports can also be shared over two or more networks or sub-networks through a switch. Although this could be accomplished by interfacing multiple instances of IP to the single NIC/NIC port or team, a more secure method of doing this over an Ethernet network is to implement VLANs through a VLAN switch. Each VLAN assigned to the NIC or NIC team is interfaced to the single instance of IP through a virtual interface for that VLAN.
Each NIC or NIC port is associated with its own MAC address and devices on an Ethernet network communicate directly by first resolving IP addresses to MAC addresses. Thus, the MAC address can be thought of as being assigned to uniquely identify the physical hardware of the device (i.e. each adapter or NIC port providing a link to the network has its own MAC address) whereas the host number is assigned to an instance of the network protocol software of the host device. For a team of two or more NICs/NIC ports, the team's shared IP address is always resolved to a single MAC address on the network side so that it looks like a single virtual interface to other devices on the network. This team MAC address can be any one of the MAC addresses associated with one of the individual team members. On the transmit side, the packets generated by the local applications can be resolved to any one of the members of the team to achieve load balancing of outgoing traffic. This is known as transmit load balancing (TLB). On the receive side, the team IP address is always resolved to the team MAC address and thus all traffic is received by the NIC port having the team MAC address as its own.
As described above, devices coupled over Ethernet® networks by network adapters communicate (i.e. route packets between them) using their respective MAC (i.e. layer 2) addresses which identify particular NICs or NIC ports. This is true even though the applications running on such network devices initiate communication (i.e. establish a connection) between one another by specifying the public host numbers (or IP addresses) of those nodes rather than MAC addresses associated with particular NICs/NIC ports. This requires that Ethernet® devices first ascertain the MAC address corresponding to the particular IP address identifying the destination device. For the IP protocol, this is accomplished by first consulting a cache of MAC address/host number pairs maintained by each network device. If an entry for a particular host number is not there, a process is initiated whereby the sending device broadcasts a request to all devices on the network for the device identified by the destination host number to send back the MAC address for the NIC port connecting the device to the network or subnet. This process is known as ARP (Address Resolution Protocol), the result of which is then stored in the cache.
The ARP packets that form the request are formed by embedding the source and destination MAC addresses, which are at least 48 bits, as well as embedding the source and destination host numbers in the payload of the packet so that the receiving device knows to which device to respond. Thus, in the example case where a single NIC exposes three interfaces with the instance of the IP protocol residing in the operating system, the ARP process resolves all three IP addresses to the same MAC address. In the case of a team of NICs sharing an IP address, only one MAC address (the team MAC address) is used for a team when responding to an ARP request. Once the packets are received by the one of the NICs designated by the destination MAC address of the packets (either a single independent NIC or the one designated to receive packets on behalf of a team) the packets are provided to the appropriate interface to the instance of IP based on the destination IP address. To load balance received packets, a network switch must be used that implements a load balancing algorithm by which it distributes the received packets to each of the team members even though they all contain the same destination MAC address. This is accomplished when the switch actually changes the destination MAC address for a packet to target a particular NIC of the team, and can therefore do so for all for all of the packets destined for the team in a manner which distributes the packet traffic across the entire team. For the IPX protocol, the ARP process is not required because the MAC address is a constituent of the IP address.
There are three types of layer 3 addresses. A directed or unicast packet includes a specific destination address that corresponds to a single network device. A multicast address corresponds to a plurality of devices on a network, but not all of them. A broadcast address, used in the ARP process for example, corresponds to all of the devices on the network. A broadcast bit is set for broadcast packets, where the destination address is all ones (1's). A multicast bit in the destination address is set for multicast packets. These source and destination addresses are derived based on the establishment of peer-to-peer virtual connections established between applications running on different network devices are defined by two (e.g. local and remote) endpoints. For example, each endpoint identifies a particular instance of TCP/IP via the public host number corresponding thereto, and a port number associated with each of the applications between which the connection is made. This transport address information (i.e. host #, port #) defining each endpoint becomes the source and destination tuple or transport address within each packet transmitted over that connection. As a point of reference, the transport address for the server node is referred to herein as the destination transport address and the transport address for the client node is referred to as the source transport address.
Typically, a service type application running on a local processing node must first establish the fact that it is running and is ready to accept connections with peer applications running on remote processing nodes of the network. This process is sometimes referred to as establishing a listening socket at the transport layer (e.g. TCP). This listening socket specifies a transport address that includes a host number or public IP address by which the local node is identified on the network and a port number that identifies the listening application uniquely from other applications running on the node. A remote node wishing to access this application as a client will typically first establish a connecting socket of its own at its TCP layer. The connecting socket is a transport address that includes a host number or public IP address that identifies the client node on the network and a port number uniquely identifying the peer application seeking the connection. The client node then sends a request over the network to the server node to establish a connection between the requesting peer and the listening application specifying the connecting and listening sockets as endpoints for the connection. The connection is then established through an acknowledgement process after which packets may be exchanged between the applications with each packet specifying the server and client transport addresses as source and destination endpoints.
There is an ever-increasing demand for maximum network performance and availability. The advent of applications such as clustered database servers and clustered applications servers requires more and more data to be handled by the servers, including the sharing of large amounts of data among the processing nodes of the cluster. Such applications have motivated computer system developers to team or aggregate network interface resources such as NICs/NIC ports both to increase the data throughput rate at the network interface as well as to provide fault tolerance for improved system availability. For a description of techniques used in support of NIC teaming to achieve increased throughput and/or fault tolerance, see for example U.S. Pat. No. 6,272,113 entitled “Network Controller System that uses Multicast Heartbeat Packets,” which was issued on Aug. 7, 2001.
Although the teaming of network resources has led to increased data throughput at the network interface, and the ever-increasing level of Central Processor Unit (CPU) performance has improved network device performance, their overall impact on network performance has been tempered by the fact that these improvements have significantly outpaced improvements in memory access speed, which has become the predominant limiting factor. Additionally, an ever-increasing percentage of CPU processing capacity is now being devoted to processing network I/O. As previously mentioned, this processing includes both packet formatting/de-formatting operations as well as data copying operations. Thus, as the amount of data to be transferred keeps increasing, the positive impact of processor performance and network interface throughput is limited because the numbers of these copy operations and their requisite demand on memory bandwidth increases commensurately.
One general approach to alleviating the memory bandwidth bottleneck and the ever-increasing demand placed generally on the processing resources of the host CPU is to establish connections that bypass the traditional protocol stack (sometimes referred to herein as the O/S protocol stack) residing in the O/S. Instead, connections are established over a bypass protocol stack residing outside of the host operating system and these offloaded connections facilitate direct placement of data between buffer memory of server and client nodes over the network. Connections that bypass the O/S based protocol stack eliminate the need for the aforementioned copying operations and also offload from the CPU the processing overhead normally associated with the formatting and de-formatting of such transactions. These offloaded connections permit the CPU of the computer system to apply freed up processing capacity to service applications and users.
One example of a set of technologies that has been developed to facilitate this technique of providing offloaded connections is often referred to generally as Remote Direct Memory Access (RDMA) over TCP/IP. Other technologies such as InfiniBand® typically have been proposed and implemented to accomplish direct data placement (DDP) using a network infrastructure that is not compatible with the existing (and widely deployed) network infrastructures such as TCP/IP over Ethernet.
Recently, the RDMA Consortium has been overseeing the development of standards by which RDMA may be implemented using TCP/IP as the upper layer protocol over Ethernet as the data link and physical layer. Various specifications for RDMA standards established by the RDMA Consortium are publicly available at www.rdmaconsortium.org. One of these technologies is a transport protocol called Sockets Direct Protocol (SDP) that extends the functionality of Sockets APIs to facilitate the establishment of both conventional TCP/IP connections as well as offloaded DDP connections. SDP emulates the semantics typically used in legacy applications written to use Sockets APIs over TCP in multiple O/S environments and therefore executes its functionality transparently with respect to legacy applications. Another such extension to Sockets API functionality is a precursor to SDP called Windows Sockets Direct (WSD) protocol, which is only available on the Windows Operating System. SDP and WSD permit legacy Sockets applications to use standard Sockets APIs such as listen, connect and accept to transparently establish offloaded connections when such connections are supported by both connecting endpoint processing nodes.
SDP and WSD are essentially libraries that intercept standard Sockets APIs and execute extended processes in response thereto to establish those offloaded connections in a manner transparent to the legacy applications. Thus, such protocol extensions enable legacy applications that speak Sockets to unwittingly set up RDMA connections between those applications when both connecting devices are configured to support them. If RDMA connections are not supported by both of the connecting nodes, the connections established between the applications running on those nodes simply default to the traditional connections established through the O/S protocol stack.
Physical connectivity to the network for offloaded connections (e.g. RDMA) is typically accomplished through a specialized network interface card often referred to as an RNIC. Each RNIC has its own protocol stack that includes its own instantiations of the upper layer protocols (e.g. TCP/IP), as well as the link layer and the physical layer for providing a physical RDMA link to the network. For an RNIC, direct data placement (DDP) protocols reside above the traditional upper layer protocols. The DDP protocols add placement information to outgoing packets over an offloaded connection to provide the RNIC at the receiving node with buffer name and location information for direct placement of the data into its buffer memory. In this way, the copy operations traditionally performed by the O/S are avoided because data is taken directly from a defined point in the application buffer for one peer application, is transmitted over the network, and then is directly placed at a defined point into the application buffer of another peer application. Likewise, the DDP protocols at the receiving end of an offloaded connection decode the placement information for direct data placement. Each RNIC also maintains connection state information for each connection established through it that facilitates communication with the user space and coordinates transfer of the data to the application that is the target of the directly placed data, and also coordinates transfer of data from the source application to be transmitted out over the network as well.
When an RNIC is used as a bypass stack through which offloaded connections may be established, the RNIC (i.e. the bypass stack) must somehow be differentiated from the conventional NIC(s) providing the traditional O/S stack (e.g. TCP/IP) connections over the network. Put another way, the lower level protocols of the RNIC must be able to differentiate between packets destined for direct data placement from those intended for conventional connections through the O/S. One solution that has been employed to differentiate between packets destined for one of the two stacks is a port mapping technique that associates with each application two port numbers, one for purposes of establishing an endpoint for a connection over the O/S stack and a second one mapped from the first for establishing an endpoint for a connection over the bypass stack. Thus, the transport address used to establish the endpoints for an offloaded connection to a particular application employs one of the public IP addresses along with the second port number to identify each client and server application so connected. Those of skill in the art will recognize that this port mapping will not be required in any situation in which there is only one stack, including where there is only an offloaded stack provided by an RNIC.
In the past, if more than one RNIC is employed at a processing node that has both an O/S and a bypass stack, each RNIC is coupled to a different network or sub-network and therefore the IP address used to contact the node publicly over each of those networks will be different. Under this scenario, the second port number is still sufficient to differentiate between the two types of connections for each network or sub-network because the local endpoints used to define those connections have different IP addresses even though they have the same second port numbers.
Of course, the offload capacity of a single RNIC may be limited to a certain number of RDMA connections based on the available resources of the particular RNIC. Moreover, RNICs can fail just as conventional NICs can. Thus, the same motivations exist for teaming or aggregating the connection capacities for two or more RNIC resources as those for aggregating the resources of standard NICs: the desire to increase the throughput of the computer system at the RDMA network interface and/or to provide fault tolerance to improve system availability. Of course, balancing the connections over the team of RNICs is desirable, just as it is desirable to balance data traffic over teams of standard NICs.
Teaming or aggregating RNICs, however, cannot be accomplished in the manner heretofore used for standard (sometimes referred to as “dumb”) Ethernet NICs. As previously discussed, standard Ethernet NICs that are teamed share the same instance(s) of IP residing in the O/S. In traditional NIC teaming, the shared instance of IP in the O/S handles packets received and/or transmitted through all members of the team; the instance of IP is oblivious as to which of the team members receives or transmits a particular packet (the MAC addresses are not presented to the IP layer of the stack). To the instance(s) of TCP and IP residing in the operating system, the team of NICs looks like one virtual NIC through an interposed teaming driver that makes the individual drivers of the team appear as a single virtual NIC driver to the shared instances of TCP and IP. To the other processing nodes on the network, the team of NICs looks like a single virtual NIC because it is addressed through that shared IP address.
This approach is not applicable for aggregating RNICs because an RNIC must maintain the states of all of the connections it is handling. This is necessitated by the fact that connection state for conventional connections are maintained within the operating system. The O/S handles the process by which data is transferred from the kernel to the specific applications in the user space. Because each RNIC connection is bypassing the O/S, that state information must be maintained locally for each RNIC. This requires that any packets traveling over an established offloaded connection must always traverse the same pair of RNICs at the two connecting nodes from establishment to dissolution of the connection. Otherwise, the data received will have no context by which to get it to the right application. Thus aggregation of a plurality of RNICs as one virtual RNIC requires that each RNIC in the team be differentiated from one another because they do not share instances of TCP/IP and connection state in the manner that traditional (dumb) NICs do.