Network congestion arises when traffic sent or injected into a communications network (i.e., the number of injected packets or bytes per unit of time) exceeds the capacity of the network. Congestion causes the throughput of useful traffic (i.e., traffic that reaches its destination) to be reduced because when the network is congested, packets hold onto network resources for longer times and/or network resources are consumed by packets that are later discarded.
Congestion control processes can be used to alleviate performance degradation during times of network congestion. Congestion control processes include:                (i) a congestion detection process for detecting congestion in the network;        (ii) a congestion notification process for signaling the congestion state of the network to appropriate nodes in the network; and        (iii) a congestion response process for reacting to congestion, such that network performance is not degraded or is degraded to a lesser degree.        
Processes to detect network congestion can be implemented in end nodes of the network and in switches internal to the network. Congestion detection processes executed by end nodes infer congestion based on network behavior attributes such as packet loss and round trip latency that can be observed from the end nodes. For example, the transmission control protocol (TCP) widely deployed in the Internet uses packet loss as an indication of congestion in the network, as described in V. Jacobson, “Congestion avoidance and control”, ACM SIGCOMM 88, pp. 314-329, August 1988 (“Jacobson”). Other processes for congestion control in TCP infer congestion based on observations of network latency, including round-trip packet latency and variations in one-way packet latency, as respectively described in L. S. Brakmo and L. L. Peterson, “TCP Vegas: End to end congestion avoidance on a global internet,” IEEE Journal on Selected Areas in Communications, Vol. 13, No. 8, pp. 1465-1480, October 1995, and C. Parsa, and J. J. Garcia-Luna-Aceves, “Improving TCP congestion control over Internets with heterogeneous transmission media,” Seventh International Conference on Network Protocols (ICNP'99), EEE Computer Society, pp. 213-221, October-November 1999.
Congestion detection processes executed by internal components of a network (such as routers and switches) infer congestion when internal network resources such as link bandwidth or network buffers are overloaded. For example, the DECbit congestion detection process detects congestion at a switch when the average size of the switch's output queues exceeds a predetermined threshold, as described in K. K. Ramakrishnan and S. Floyd, “A Proposal to add Explicit Congestion Notification (ECN) to IP,” IETF RFC-2481, January, 1999 (“Ramakrishnan”). As described in S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, Vol. 1, No. 4, pp. 397-413, August 1993, the RED congestion detection process also uses the average output queue size to infer congestion, but uses two thresholds. Because congestion detection processes executed by network elements watch for particular events at individual network components, they are likely to be more precise in their information than processes executed by end nodes. Moreover, they allow congestion to be detected earlier, even before it manifests as lost packets or changed latencies at network end nodes.
When congestion is detected at internal network elements, a congestion notification process is executed to communicate the congestion state to other nodes in the network. These notification processes are referred to as Explicit Congestion Notification (ECN) processes, as described in Ramakrishnan. With Forward Explicit Congestion Notification (FECN), congestion detected at a network switch element is signaled to the destination nodes of the packets involved in the congestion. The destination nodes can, subsequently, propagate this information to the respective source nodes. Signaling of the destination node as well as the subsequent signaling of the source node can occur in-band using congestion marker bits in the network packets themselves or can occur out-of-band using congestion control packets dedicated to carrying congestion information. The DECbit and RED processes modify ECN bits in packet headers to notify (in-band) the destination nodes of congestion. Network switch elements can also communicate congestion information to source nodes directly without communicating it through the destination node. With this Backward Explicit Congestion Notification (BECN) approach, the switch creates and transmits a congestion control packet carrying congestion information back to the source node.
Congestion response processes determine how traffic injection is adjusted in response to changes in the congestion state of the network. Response processes are typically executed by end nodes of the network (e.g., TCP); however, in some cases (e.g., ATM), these can be executed by network switches. These response processes can control traffic injection in two ways. One method is to limit the number of packets that can be concurrently ‘in flight’ in the network between a pair of communicating source and destination nodes. This window control technique uses acknowledgment messages from the destination to the source to indicate which messages have been received (i.e., which messages are no longer in flight). A second method is to control the rate at which packets are injected (or, equivalently, the time interval between packets) into the network. Unlike window control, the rate control technique does not necessitate acknowledgment messages from the destination. Both these techniques have been widely used. Congestion control in TCP, as described in Jacobson, is a well-known example of window control, and traffic management in ATM is a well-known example of rate control. In either case, congestion response processes limit traffic injection when congestion is detected and increase traffic injection when congestion is not detected for a length of time. Policies that determine the adjustments to the traffic injection window or rate, as described in D. Chiu and R. Jain, “Analysis of the increase and decrease algorithms for congestion avoidance in computer networks,” Computer Networks and ISDN Systems, 17(1), June 1989, pp. 1-14, are an important design parameter. In addition to controlling congestion, these policies are designed to support high throughput, fairness among competing network flows, fast response to congestion and low oscillations in network behavior.
A number of attributes of network architecture influence the design of network congestion control processes. One of these is the manner in which the network is designed to handle packets when buffer space is not available to buffer them at a switch. Many networks, such as Ethernet networks, permit network switches to drop incoming packets if space is not available to buffer them. In this scenario, packet losses are available as hints to detect network congestion. Many other networks, such as Infiniband, as described in “Infiniband Architecture Specification Release 1.0.a,” are designed to avoid packet dropping due to buffer overruns. These networks incorporate a link level flow control process which blocks a switch to prevent it from transmitting a packet over a link if the downstream switch at the other end of the link does not have sufficient buffering to receive the packet. Link level flow control is typically implemented using a credit based method in which receiver logic at one end of the link periodically sends control packets granting credits to transmitter logic on the other end of the link. The transmitter can send as many packets as are permitted by these credits and blocks when it has exhausted its credits. The transmitter remains blocked until it receives more credits. In networks with link level flow control, packets are not discarded by the network (except under error conditions such as failure of a switch or link). Hence packet losses are not available as hints to detect congestion in such networks.
Congestion occurs when the demand for a network resource (such as a link) exceeds the capacity of the network resource. For example, two flows 102, 104 can share a single bottleneck link 106, as shown in FIG. 1. If the two flows 102, 104 provide sufficiently heavy loads, the bottleneck link 106 will not be able to simultaneously accommodate all the traffic from both flows 102, 104. In all networks, this congestion will first manifest as an increased number of packets buffered in the switch 108 at the congested link, soon growing to the extent that no additional packets destined for the congested link 106 can be buffered at that switch 108. The subsequent development of the congestion depends on the manner in which the network is designed to handle packets when buffer space is not available to buffer them at a switch.
If the network permits switches to drop incoming packets upon congestion, some packets that would otherwise traverse the congested link 106 will be discarded as long as congestion persists. Buffers for the congested link 106 will remain fully occupied and the network's useful packet throughput will drop. However, because the congested switch 108 does not block upstream switches from transmitting packets, buffers in upstream switches will continue to drain. This allows packets that are not traversing the congested link 106 to flow through the network with little, if any, additional delay or loss in throughput.
In networks with link level flow control, packets are not dropped. However, congestion can lead to an undesirable effect known as congestion spreading or tree saturation. When a switch buffer fills up due to congestion, it blocks the buffer's upstream node. This blocking can spread further upstream until buffers fill all the way back to the source nodes of the affected traffic flows. The particular disadvantage of congestion spreading is that it affects flows that do not exert any load on the oversubscribed link resource. For example, consider the scenario shown in FIG. 2 with two switches 202, 204, each with buffering at its input ports, and four traffic flows, 206 to 212, each of which injects packets as rapidly as possible. Three traffic flows 206, 210, and 212 are all directed to a first destination link 214, and a fourth flow 208 is directed from a source link 216 to a second destination link 218. The fourth flow 208 shares an inter-switch link 220 with the first flow 206. Ideally, the sum of the throughputs of the first 206 and fourth 208 flows should equal the capacity of the inter-switch link 220. However, if the first destination link 214 is oversubscribed, then the input buffers at the switches 202, 204 become full with packets. In particular, the input buffer at the inter-switch link 220 of the right switch 204 will fill with packets and block flow on the inter-switch link 220. Therefore, the inter-switch link 220 goes idle, wasting bandwidth that could be used for transmitting packets from the second source link 216 to the second destination link 218 by the fourth flow 208. Assuming a fair switch scheduling process, each of the three flows 206, 210, 212 directed to the first destination link 214 uses approximately one third of the bandwidth of the bottleneck or first destination link 214. Assuming further a fair switch scheduling policy, the left switch 202 will alternate packets from the first flow 206 and the fourth flow 208 when scheduling packets on the inter-switch link 220. Therefore, the fourth flow 208 will be transmitted at the same rate as the first flow 206, i.e., assuming equal bandwidth for the inter-switch link 220 and the bottleneck link 214, the fourth flow 208 and the first flow 206 will consume one third of the inter-switch link bandwidth. Therefore one third of the inter-switch link bandwidth which could be used to send packets from the fourth flow 208 is wasted. If the upstream source node on the source link 222 of the first flow 206 was informed that it could not transmit at the full link bandwidth and reduced its rate to the rate determined by the bottleneck link 220, i.e., one third of the link bandwidth in this example, the buffers at the switch 204 would not fill up, and the bandwidth at the inter-switch link 220 could be more efficiently utilized by the fourth flow 208.
Effective network congestion control requires an effective congestion detection process. Congestion detection processes in network switches can infer congestion by detecting oversubscription of link and/or buffer resources. However, these processes should be capable of distinguishing oversubscription due to persistent congestion from transient oversubscription due to bursty behavior in network traffic. It is also desirable for the congestion detection process to be applicable to networks that drop packets when buffers fill up as well as networks that implement a link level flow control process to avoid packet losses. The congestion detection process should also be compatible with switch architectures that differ in their buffering organization (e.g., whether they use buffers at their input ports or their output ports etc.).
The ATM forum has proposed congestion control for its ABR service class by executing a congestion detection process in switches. Switches monitor the current number of connections routed through the switch as well as the traffic generated by these connections. The switch determines the distribution of the available bandwidth among all the active connections and sends control packets to end nodes, informing them of the rate at which packets should be generated. The main disadvantage of this process is that switches maintain state information for each connection to manage traffic on a per-connection basis. This increases switch complexity and limits the ability to scale to large networks with a large number of connections.
Congestion detection processes used in the DECbit and RED processes are simpler and more scalable because they do not require the maintenance of per-connection state. The DECbit process detects congestion by comparing the average size of switch output queues to a predefined threshold. The average queue size is periodically computed and congestion is reported (through an ECN process) when the average queue size exceeds the threshold. The RED process is similar, but uses two thresholds. When the average queue size exceeds the first threshold, the ECN bits of packets in the queue are randomly set with a given probability, and when the second threshold is reached, the ECN bits of all packets in the queue are set.
These and other previously proposed processes for congestion detection in network switches have been directed at networks that permit packets to be dropped if buffer space is unavailable. Because high (or complete) buffer utilization is limited to the congested switch in such networks, these processes are likely to identify the congestion point reasonably accurately. However, in networks that employ link level flow control (such as Infiniband), congestion spreading can result in full buffers in switches other than the congested switch as well. In this environment network flows that are causing congestion spreading should be distinguished from flows that are suffering congestion spreading.