1. Field of the Invention
The present invention relates to networks with bundles of communication links between at least one pair of network devices, as in the core infrastructure of a large enterprise or service provider network; and, in particular, to apportioning a stream of data packets among the bundle of communication links based on physical state of the ports connected to the bundle.
2. Description of the Related Art
Networks of general purpose computer systems connected by external communication links are well known and widely used in commerce. The networks often include one or more network devices that facilitate the passage of information between the computer systems. A network node is a network device or computer system connected by the communication links.
Information is exchanged between network nodes according to one or more of many well known, new or still developing protocols. In this context, a protocol consists of a set of rules defining how the nodes interact with each other based on information sent over the communication links. The protocols are effective at different layers of operation within each node, from generating and receiving physical signals of various types, to selecting a link for transferring those signals, to the format of information indicated by those signals, to identifying which software application executing on a computer system sends or receives the information. The conceptually different layers of protocols for exchanging information over a network are described in the Open Systems Interconnection (OSI) Reference Model. The OSI Reference Model is generally described in more detail in Section 1.1 of the reference book entitled Interconnections Second Edition, by Radia Perlman, published September 1999, which is hereby incorporated by reference as though fully set forth herein.
Communications between nodes are typically effected by exchanging discrete packets of data. Each packet typically comprises 1] header information associated with a particular protocol, and 2] payload information that follows the header information and contains information to be processed, often independently of that particular protocol. In some protocols, the packet includes 3] trailer information following the payload and indicating the end of the payload information. The header includes information such as the source of the packet, its destination, the length of the payload, and other properties used by the protocol. Often, the data in the payload for the particular protocol includes a header and payload for a different protocol associated with a different, higher layer of the OSI Reference Model. The higher layer protocol is said to be encapsulated in the lower layer protocol. The headers included in a packet traversing multiple heterogeneous networks, such as the Internet, typically include a physical (layer 1) header, a data-link (layer 2) header, an internetwork (layer 3) header and a transport (layer 4) header, as defined by the Open Systems Interconnection (OSI) Reference Model.
Some protocols span the layers of the OSI Reference Model. For example, the Ethernet local area network (LAN) protocol includes both layer 1 and layer 2 information. The International Electrical and Electronics Engineers (IEEE) 802.3 protocol, an implementation of the Ethernet protocol, includes layer 1 information and some layer 2 information.
Routers and switches are network devices that determine which communication link or links to employ to support the progress of packets through the network. For example, Ethernet switches forward packets according to the Ethernet protocol. Some current routers implement sophisticated algorithms that provide high performance forwarding of packets based on combining layer 2 and layer 3 header information, or some other combination. For example, instead of making forwarding decisions separately on each packet in a stream of related packets (called a “packet flow” or simply a “flow”), such as a stream directed from the same source node to the same destination node, these routers identify the packet flow from a unique signature derived from the layer 2 or layer 3 header information and forward each member of the flow according to the same decision made for the first packet in the flow.
The number of bits that are carried over a communication link in a unit time is called the speed or bandwidth of the communication link. The bandwidth of a particular link is limited by the physical characteristics of the cable and the port on each network node to which the cable is connected. As used here, a port is a physical interface on a network device that is, or can be, connected to a cable to serve as a communication link with a port on another network device. For example, three types of widely used Ethernet links have three different bandwidths of 100 Megabits per second (Mbps, where 1 Megabit=106 binary digits called bits), 1 Gigabit per second (Gbps, where 1 Gigabit=109 bits), or 10 Gbps. These three bandwidths are termed Fast Ethernet, Gigabit Ethernet and 10 Gigabit Ethernet, respectively.
In some circumstances, the bandwidth needed between two nodes does not match one of the readily available bandwidths. In such circumstances, some networks bundle multiple communication links between a pair of network nodes. For example, if network traffic between a particular server and an Ethernet switch in an office building needs bandwidth up to 500 Mbps, then it might be more cost-effective to connect five Fast Ethernet ports on each device rather than to install a Gigabit Ethernet port on each device and string a Gigabit cable in the walls between them. The five Fast Ethernet links in this example constitute a bundle of communication links. Similarly, if network traffic needs exceed 10 Gbps, then these needs can be met with a bundle of two or more 10 Gigabit Ethernet communication links. Link Aggregation Control Protocol (LACP) is part of an IEEE specification (802.3ad) that allows several physical ports to be bundled together to form a single logical channel. LACP allows a switch to negotiate an automatic bundle by sending LACP packets to the peer.
Bundled communication links are commercially available. For example bundled Ethernet links are available from Cisco Systems, Inc. of San Jose, Calif. as ETHERCHANNEL™ capabilities on Ethernet switches and routers. Bundled links are also available on routers for use with a Synchronous Optical Network (SONET) for optical communication links as part of packet over SONET (POS) technology from Cisco Systems.
A load-balancing process is used on the sending network node of the pair connected by a bundle of communication links for the purpose of determining which communication link to use for sending one or more data packets to the receiving network node of the pair. Current balancing algorithms use a fixed mapping to associate data packets with a specific port in a set of ports connected to the communication links in the bundle. Typically, information in a header portion of a data packet is used to derive a value that is associated with one port of the set. The algorithm is designed to generate a value in a range of values that are associated with the full set of ports. Thus data packets directed to the receiving node are distributed over all communication links in the bundle by the load balancing process. Many load-balancing processes are designed so that all data packets in the same data flow are sent through the same port.
While suitable for many purposes, fixed-mapping load balancing suffers some deficiencies that result in poor utilization of the entire bandwidth available on a bundle of communication links.
Typically, the fixed-mapping takes several bits from one or more fields in layer 2 or layer 3 headers, or both, and inputs those bits to a hash function that produces an output with a certain number of bits. The output is then used directly or indirectly to select a port among the set connected to the bundle of communication links. By judicious choice of the fields, data packets from the same flow may be mapped to the same port.
For example, if there are eight communication links in a bundle, some fixed-mapping load-balancing processes map different data packets to one of eight values, such as by using a hash function with a three-bit output. Three bits represent eight different values (0 to 7) which are associated with the eight different ports in the set connected to the eight communication links. While such an approach may cause data packets with similar values in their layer 2 and layer 3 headers to be directed to different ports of the set, there is no guarantee that the process will distribute traffic uniformly across the set of ports. For example, a disproportionate number of data packets might be mapped to the value 5. Thus some ports may still become overused, causing a reduction in the effective bandwidth.
In some approaches, the fixed-mapping is adjusted to accommodate different bundle sizes. For example, if bundle sizes are allowed to vary from two to eight communication links per bundle, some fixed-mapping load-balancing processes map different data packets to one of eight values, such as by using a hash function with a three-bit output, as described above, to accommodate the largest bundle. In smaller bundles, the 8 possible output values are then mapped to the three active communication links. For example, output values 0,3,6 are mapped to the first port, values 1,4,7 are mapped to the second port, and values 2,5 are mapped to the third port. Because there are three values mapped to the first two ports compared to two values mapped to the third port, the third port will be underutilized compared to the first two. The underutilization of the third port occurs even if the distribution of the original eight values is uniform.
Even if the mapping of data flows to ports is uniform, underutilization of a bundle can occur. For example, consider the situation in which two packet flows include 10,000 data packets and 10 other packet flows each include 100 packets, all packets of the same size. If there are three communication links in the bundle, then the fixed-mapping process is likely to send four packet flows to each of the three ports connected to corresponding communication links in the bundle. Two of the three ports might carry 10,300 data packets, the third will carry 400 data packets. The first two ports might become overused even while the third port is underused, leading to a reduction in the rate at which the 10,300 data packets are sent over the first and second communication links. The bundle as a whole performs at a rate less than its advertised capability. The situation could be even worse if both large data flows are sent over the same port; then as many as 20,200 packets are sent over the first port while the second and third ports carry only 400 packets each. It would likely be preferable to send one flow of 10,000 data packets over each of the first two ports and ten flows totaling 1,000 data packets over the third port.
There are disadvantages in approaches to distribute data packets from the same packet flow on different communication links of a bundle. A major problem arises because variable delays are experienced on every communication link, caused for example by congestion, noise and errors. If successive data packets from the same flow are placed on different links, the later packet might experience a smaller delay than the earlier packet and arrive at the destination node out of order. Out-of-order data packets create problems for the receiving node. For example, in some protocols, out-of-order data packets cause a receiving node to determine that there is missing data, and the receiving node may add to congestion on the link by sending requests on the link to resend several data packets and then receiving the resent data packets on that link. In some protocols, out of order packets are simply discarded.
Another problem that occurs with fixed mapping is hash polarization. As several intermediate network nodes use the same hash function in the fixed mapping, the output hash values tend to bunch on the same value. Among other factors, this occurs because once two flows have hashed to the same value, they will arrive at the next network node on the same link and tend to be grouped with the same flow ID. Thus, once joined the two flows will not separate.
In general, fixed-mapping load balancing can result in nonuniform distribution of data packets across the bundle of communication links, and thus result in performance below the full capacity of the bundle.
In a more recent approach, load balancing selects a port based on the degree to which buffers that hold data being sent out on each port are filled. A data packet from a new flow is directed to a port that has a buffer that is not full. While this approach tends to distribute data packets from new flows to ports more able to handle the new flows, it experiences some problems. For example, if a long sequence of data packets from the same flow are directed to the ports, they are all directed to the same port to preserve sequence order. The port receiving the data packets from this flow can become full. The next data packet from that flow cannot be placed until the port buffer fill level drops. Data transmission on the entire bundle halts until that next data packet can be placed in its target port buffer. The bundle thus does not perform at advertised capacity.
Based on the foregoing, there is a clear need for a load-balancing process for bundles of communication links that does not suffer all the deficiencies of the prior art approaches. In particular, there is a need for a dynamic load-balancing process that distributes data packets among a bundle of communication links based on the actual utilization of those links other than fill level. There is also a need for a dynamic load-balancing process that selects data packets from different packet flows so that a data flow directed to a filled port does not overly limit capacity of the bundle.