The present invention is directed to communications networks. It particularly concerns congestion avoidance.
Internetwork communications are based on operations of routers, which are network devices that determine, on the basis of destination information in packets that they receive, where to forward the packets so that they are likely to reach the intended destinations.
Router configurations vary widely, but FIG. 1 depicts a typical approach. Router 10 includes a plurality of communications interfaces 12, 14, and 16, which send and receive communications packets to and from remote locations. When one of the interface modules receives an incoming packet, it places header information from that packet onto an internal communications bus 18 by which it communicates with a forwarding engine 20, typically a high-performance processor and associated storage circuitry, that determines where the packet should be sent. Once the decision has been made, an output packet is formed from the input packet by packet-assembly circuitry that may reside in one or more of the interface modules and/or the forwarding engine, and the forwarding engine causes another interface to send the output packet to a further remote location.
FIG. 2 depicts the router 10 in a local-network environment in which it communicates through one of its interfaces with a local-area-network bus 22. There are typically a number of network devices, such as network devices 24, 26, 28, and 30, that receive the resultant signals, but the packet is not usually intended for all of them. To enable its various network devices to distinguish the packets they should read from the ones they should not, a system may employ a packet format such as the Ethernet format that FIG. 3""s second row depicts. An Ethernet frame""s link-layer header 32 includes, among other fields, one that contains a link-layer destination address, as FIG. 3""s first row indicates. Any network-device interface whose link-layers address does not match the pocket""s destination-address field will ignore the packet.
For present purposes, we will assume that FIG. 2""s router 10 intends for a further router 30 to receive the packet, so the link-layer header""s destination-address field contains the link-layer address of router 30""s interface with network link 22. That interface accordingly reads the remainder of the packet, verifying that the packet and its cyclicredundancy-code (xe2x80x9cCRCxe2x80x9d) trailer""s contents are consistent. Router 30 then proceeds to process the link-layer packet""s payload 36 in accordance with a protocol that the link-layer header""s type field specifies.
In the illustrated case, the type field specifies that the link-layer packet""s payload is an Internet Protocol (xe2x80x9cIPxe2x80x9d) datagram, which is a network-layer protocol data unit (xe2x80x9cPDUxe2x80x9d) that consists of an IP header and payload. The IP header is similar to the Ethernet header in the sense that it has source- and destination-address fields. But the destination-address field in this case does not identify the next network node to handle the packet. Instead, it contains a route identifier in the form of the network address of the destination node to which the packet should ultimately be forwarded. That is, a router to which the Ethernet (or other link-level) header directs the packet will read the IP header""s destination-address field and identify, on the basis of that field""s contents, the xe2x80x9cnext-hopxe2x80x9d router to which forwarding the packet will advance it toward that ultimate destination. Similarly, the IP header""s source address does not identify the forwarding router but rather the node that was the IP datagram""s initial source.
The IP protocol provides for what is termed an xe2x80x9cunreliablexe2x80x9d delivery. That is, there is nothing in the protocol itself to ensure that the host with which the packet""s internetwork address is associated will actually receive that packet. For various reasons, routers along the way to the destination may not be able to forward that particular packet, so the packet will not reach its destination. The IP datagram""s payload may therefore include information that source- and destination-host processes can use to ensure proper information delivery, i.e., to determine whether the destination has received all the source""s transmissions and cause the source to re-send any that are missing.
One way to achieve this result is to employ a reliable transport protocol, such as the Internet community""s Transmission Control Protocol (xe2x80x9cTCPxe2x80x9d). Specifically, if the IP header""s protocol field contains the code that specifies TCP, the ultimate-destination host will use its TCP process to deal with the IP payload. FIG. 3""s third, fourth, and fifth rows show how the TCP process interprets that payload. Specifically, the IP datagram""s payload is considered to be a TCP segment, consisting of a TCP header and a TCP payload. Among its other fields, the TCP header includes source- and destination-address fields, which specify by xe2x80x9cport numbersxe2x80x9d the host applications that send and receive the TCP payload.
Of particular interest in connection with the transmission""s reliability are the TCP header""s sequence-number and acknowledgment-number fields. A sequence of TCP segments sent from one port of one node to the same or another port of another node are considered to constitute a single session. If set, a xe2x80x9cSYNxe2x80x9d flag in FIG. 3""s fifth row indicates that TCP segment containing it is the first in a session. TCP segment containing a set xe2x80x9cFINxe2x80x9d flag is the session""s last. The sequence-number field in a segment carrying the SYN flag contains a number arbitrarily assigned to the first byte of that session""s data. Beginning with that number, the TCP process increments an internal count for each byte sent in each of the same session""s (multiple-byte) segments. Each segment""s sequence number field contains the sequence number thereby assigned to that segment""s first byte.
The receiving TCP process is expected to acknowledge each received segment, and a set xe2x80x9cACKxe2x80x9d flag in a segment from the receiving process indicates that the segment""s acknowledgment-number field is the next-numbered byte that it expects. Such an acknowledgment means that the receiving process has received bytes associated with all sequence numbers from the SYN-segment sequence number to the number just before the current segment""s acknowledgment number.
By its response, a receiving TCP process controls the rate at which the sending process transmits data to it. Specifically, it places in the TCP header""s window-size field an indication of the number of bytes the sender can transmit before it has to stop to wait for an acknowledgment. If the receiving TCP process has two kilobytes of capacity left in its input buffer, for instance, it may specify a window size of two kilobytes to ensure that the sending TCP process transmits no more bytes than that before it receives further acknowledgment. If the receiving TCP process acknowledges received bytes only after it has removed them from its input queue, the sender process will not sent the receiver more bytes than its queue can handle. A relatively slow receiving TCP process can thereby prevent a higher-capacity sending TCP process from overwhelming it with data.
But this does not prevent an intervening router from being overwhelmed by the resultant flow or, more typically, by that flow in combination with the others that the router must handle. When a router is overwhelmed, it typically simply discards the excessive IP packets. This discarding does not impair the TCP processes"" reliability, because a sending TCP process re-transmits bytes that have remained unacknowledged for more than a predetermined time interval. But discarding packets and re-sending them wastes bandwidth.
To avoid such waste-causing congestion, a TCP sending process often employs what is known as a xe2x80x9cslow start,xe2x80x9d in which it initially transmits at a rate that is less than the receiver process""s advertised window size would otherwise permit. In a typical slow-start operation, the sending process sends only a single segment and then waits for the resultant acknowledgment. If it receives the resultant acknowledgment, it sends two and then again waits for an acknowledgment. It then sends four after those two have been acknowledged. The number of permitted unacknowledged segments thus increases exponentially until it is limited by the receiving TCP process""s window sizexe2x80x94or until the transmitting TCP process fails to receive an acknowledgment within the required time limit. If such a failure occurs, the sending TCP process concludes that the segment may not have successfully reached its destination, possibly because an intervening router""s capacity has been exceeded. It therefore stops the exponential increase and limits the permitted number of outstanding segments to a level that does not provoke unacknowledged segments. The sending and receiving processes thereby accommodate not only each other""s limitations but also those of the routers employed in communicating between them.
We have recognized, though, that this approach to congestion avoidance has certain limitations. Not the least of these is that it forces, say, an Internet-service provider (xe2x80x9cISPxe2x80x9d) to rely for congestion avoidance on client-network nodes, i.e., nodes over which it has no control. We have solved this problem by adapting to it a congestion-avoidance approach commonly employed in Asynchronous Transfer Mode (xe2x80x9cATMxe2x80x9d) networks.
ATM networks sometimes employ a congestion-avoidance approach known as adjustable bit rate (ABR), in accordance with which a source of ATM communication cells transmits resource-management cells (RM cells) along a given virtual channel. Before it communicates on such a channel, the source asks for an ATM virtual channel with certain ABR-related parameters. These include the Peak Cell Rate (PCRxe2x80x94the maximum rate at which it may send cells) and the Minimum Cell Rate (MCRxe2x80x94the lowest rate at which it can be told to send cells). These parameters do not change so as long as the virtual channel remains in place. If the switches along the virtual channel""s path do not have enough capacity to accommodate the MCR, the virtual channel will not be established.
Once the connection is established, the source sends periodic resource-management cells. The resource-management cell specifies a desired cell rate in an explicit-rate (xe2x80x9cERxe2x80x9d) field. ATM switches forward the resource-management cell along the virtual channel""s route and back again to the source, and each switch along the route determines whether it can allocate the requested bandwidth to that virtual channel.
If a switch that receives a resource-management cell can accommodate the requested bandwidth, it forwards the resource-management cell without revising the explicit-rate field. If the switch cannot accommodate the requested rate but can accommodate the minimum rate, it forwards the resource-management cell with the explicit rate set to the lesser rate that the switch can accommodate. When the resource-management cell finally returns to the source, it thereby tells the source the rate at which the source may send. The source then limits itself to this rate, so the ATM system suffers virtually no cell loss.
We have recognized that this approach is nonetheless unacceptable, or at least undesirable, for a great number of applications. It is true that such an arrangement permits the ISP to avoid bandwidth waste within its network. But an attempt to establish a channel may fail due to lack of bandwidth. The ISP""s customers can thereby be denied access, and customers interpret this as a lack of ISP reliability.
We have solved this problem by employing a system that replaces minimum rates with what we refer to as weights. Specifically, if a router does not have the capacity to meet the total bandwidth requested for all the flows that propose to share it, it simply allocates in proportion to the different routes"" advertised weights the bandwidth that it can provide. In doing so, there are at least some flows on which it imposes no minimum rate at all. Those flows are therefore guaranteed some access to the ISP network, regardless of how little bandwidth is available.
In addition, the existing ABR approach is only available in a network using ATM technology. We have extended the ABR technique to networks of IP routers and label-switching routers that may utilize a wide range of underlying network technologies.