Data communication in a computer network involves the exchange of data traffic between two or more entities interconnected by communication links. These entities are typically software programs executing on hardware computer platforms which, depending on their roles within the network, may serve as end nodes or intermediate network nodes. Examples of end nodes may include client and server computers coupled to the communication links, whereas the intermediate nodes may include routers and network switches that interconnect those links. A server is a special-purpose computer configured to provide specific services; when operating according to a client/server model of information delivery, the server may be configured to allow many clients to access its services. Each client may request those services by issuing protocol messages (in the form of packets) to the server over the communication links, such as a point-to-point or shared local area network (LAN) medium. The server then responds to the client request by returning the requested services in the form of packets transported over the network medium.
The server may include a plurality of ports or physical interfaces coupled to the communication links, wherein each interface is assigned at least one Internet protocol (IP) address and one media access control (MAC) address. A virtual interface or aggregate comprises an aggregation of the physical interfaces and their links. When logically combined as an aggregate, each physical interface responds to at least one IP address and to at least one common MAC address. Aggregation of physical links into a single virtual interface is well known and described in IEEE Standard 802.3ad, which is hereby incorporated by reference as though fully set forth herein.
All network entities, such as clients and switches, view the aggregate as a single network interface that provides a high data transfer rate to and from the server. In other words, the entities view the aggregate as a linear multiple of all the underlying physical interfaces and links. The physical interfaces of an aggregate advertise their common MAC address using, e.g., an address resolution protocol (ARP) to update a MAC table on a switch coupled to the server. The switch uses the table to determine the MAC address to which each physical interface on the server responds. When forwarding client data to traffic directed to the server, the switch may utilize any of the aggregated physical links to transport that data traffic. However, to provide the high data transfer rate, the data traffic “load” should be balanced across all the underlying physical links.
Conventional load balancing algorithms are used to uniformly distribute data traffic over all of the underlying physical links of an aggregate, thereby increasing the bandwidth efficiency of those links. Since the server generally responds to requests issued by a client, the switch may determine the type of load balancing policy applied to the aggregate. That is, the server may deliver its response over the same link that was used to receive the request. An example of a load balancing policy is an algorithm based on the MAC addresses of the clients serviced by the server. The switch employs this loadbalancing algorithm to map a client MAC address to a physical link of the aggregate. This same type of algorithm may be applied to another conventional load balancing policy based on the IP addresses of the clients and the server (aggregate). Here, the switch logically combines the source IP address of each client with the destination IP address of the aggregate to map that client IP address to a physical interface link of the aggregate.
Each of these conventional algorithms assumes a uniform distribution of clients throughout the network such that the mapping of client addresses to underlying physical links of the aggregate result in a substantially even distribution of traffic across all of the underlying links. However, if the clients are not uniformly distributed throughout the network, it is possible that the data traffic may be unevenly distributed over the links. For example, if a substantial amount of data traffic is forwarded by the switch over only one or two of the underlying links, the remaining links become substantially unused, thereby adversely impacting the bandwidth utilization of the aggregate. The present invention is directed, in part, to solving this problem.
Another example of a conventional load balancing policy is an algorithm based on “pure” round robin scheduling of data packets among the aggregated links. Round robin scheduling is a desirable load balancing policy for an aggregate because, in its simplest implementation, the policy specifies dividing the number of packets evenly across all the underlying links of the aggregate. That is, a first packet is sent on a first underlying link, a second packet is sent on a second underlying link and so forth, wherein the data packets are continuously cycled among all of the links. However, implementation of the round robin policy may result in retransmissions of data packets due to, e.g., glitches associated with the underlying network links or “out-of-order” delivery of packets over those links. Network glitches may arise due to hardware problems, such as failed links or shortage of memory resources on a receiver.
In general, the type of data traffic served by the server may comprise user datagram protocol (UDP) or transport control protocol (TCP) traffic. The UDP and TCP protocols are well known and described in Network Protocols, Signature Edition, by Matthew G. Naugle, McGraw-Hill, 1999, at pgs. 519-541. In the case of a file system protocol, such as the conventional network file system (NFS) protocol, the size of a typical UDP datagram is 32K bytes (B). For an Ethernet medium coupling a NFS client to the server, the maximum transfer unit (MTU) size of each packet transferred over the medium is 1.5 KB. Therefore, an IP layer of the server apportions each datagram passed by a UDP protocol layer into approximately 23 fragments, wherein each fragment is transmitted over the medium as a single packet. Each fragment of a UDP datagram has a similar IP identifier (ID), but has a different fragment offset number.
When employing pure round robin scheduling to balance the UDP datagram load over the aggregate, the fragments/packets constituting the datagram are distributed evenly over the underlying physical links. However, a glitch associated with one of the links may result in loss (“dropping”) of some of the fragments. For example, assume the server executes a round robin policy to uniformly distribute 23 fragments associated with a first UDP datagram over the underlying links of the aggregate followed by 23 fragments associated with a second UDP datagram. Assume further that because of a glitch with one of the links, some fragments of the first datagram are dropped. In accordance with the NFS protocol, the entire 23 fragments of the first UDP datagram must be retransmitted to the NFS client. If fragments of both the first and second datagrams are dropped because of the glitch, then the entire 46 fragments of both datagrams must be retransmitted. This results in substantial consumption and inefficient usage of network bandwidth. The present invention is further directed to solving this problem.
In contrast, the size of a typical TCP datagram is equal to the MTU size of the physical network medium or, e.g., 1.5 KB when transported over an Ethernet medium. Each typical TCP datagram is transmitted over the network medium as a single fragment/packet having a unique IP ID. Load balancing of TCP fragments/packets over the aggregate results in an even distribution of data packets over the underlying physical links. Yet, in the presence of heavy network traffic, the packets may arrive “out of order” at, e.g., an IP reassembly queue on the client. Specifically, a first packet transported over a first link of the aggregate may not arrive at the client before the second packet trans-ported over a second link of the aggregate if, for example, the first link is “down” or has more pending traffic then the second link. Also, if a switch is interposed between the client and server, then the order of the packets delivered by the server is not guaranteed through the switch and onto the client because of, e.g., differing lengths of the links and pending traffic at the switch. This results in inefficient consumption of memory resources and processing delay of the packets, along with possible retransmissions of the packets and inefficient use of network bandwidth. Accordingly, the present invention is directed to increasing the efficiency of network bandwidth over the underlying links of an aggregate.