A data source faces a dilemma whenever it has little or no information about how much capacity is available, but it needs to send data as fast as possible without causing undue congestion. A data source faces this dilemma every time it starts a new data flow, every time it re-starts after an idle period, and every time another flow finishes that has been sharing the same capacity.
The family of congestion control algorithms that have been proposed for TCP combine two forms of operation: one dependent on congestion feedback (closed-loop control), the other at times when there is no feedback (open-loop control). On the current Internet, open loop control has to be used at the start or re-start of a flow or at the end of a competing flow, when the sender has little or no information on how much capacity is available.
For instance, a large majority of TCP algorithms uses the same ‘slow-start’ algorithm to exponentially increase the sending rate, probing for more capacity by doubling the sending rate every round trip, until the receiver feeds back that it has detected a loss as the first signal of congestion. The sender receives this feedback one round trip time after its sending rate exceeded available capacity. By the time it receives this signal it will already be sending more than twice as fast as the available capacity.
A concept called the congestion window is used within the TCP algorithm to control its rate. The window is the amount of data that can be sent in excess of the data that has been acknowledged. With little or no knowledge of the available capacity (open-loop) it is difficult to argue whether one congestion window is better than another—any behaviour could be safe in some circumstances and unsafe in others. Internet standards say a flow should start with a window of no more than 4380B (3 full-sized packets over Ethernet), and a window of 10 packets is currently being experimented with. Numbers like these are set by convention to control a flow's behaviour while it has no better information about actual available capacity (open-loop control). Similarly, there is no particular reason why TCP doubles its window every round trip during its start-up phase. Doubling certainly matches the halving that another part of the TCP algorithm does during its closed-loop (or ‘congestion avoidance’) phase. However, the choice of the number two for doubling and halving was fairly arbitrary.
This doubling does not always interact well with non-TCP traffic. Consider the case of a low-rate (e.g. 64 kb/s) constant-bit-rate voice flow in progress over an otherwise empty 1 Gb/s link. Further imagine that a large TCP flow starts on the same link with an initial congestion window of ten 1500B packets and a round trip time of 200 ms. To discover how much capacity is available, the flow keeps doubling its window every round trip until, after nearly eleven round trips, its window is 16,666 packets per round (1 Gb/s). In the next round it will double to 2 Gb/s before it gets the first feedback detecting drops that imply that it exceeded available capacity a round trip earlier. About 50% of the packets in this next round (16,666 packets) will be dropped. This huge loss of packets is the best case scenario if the buffer is correctly sized.
In this example TCP has already taken 11 round trip times, over 2 seconds in this case, to find its correct operating rate. Further, when TCP drops such a large number of packets, it can take a long time to recover, sometimes leading to a black-out of many more seconds (100 seconds has been reported [Ha08] due to long time-outs or the time it takes for the host to free-up large numbers of buffers). In the process, the voice flow is also likely to black-out for at least 200 ms and often much longer, due to at least 50% of the voice packets being dropped over this period.
This shows there are two problems during flow-startup: i) a long time before a flow stabilises on the correct rate for the available capacity and ii) a very large amount of loss damage to itself and to other flows before a newly starting flow discovers it has increased its rate beyond the available capacity (overshoot).
These problems do not only arise when a new flow starts up. A very similar situation occurs when a flow has been idle for a time, then re-starts. When a flow restarts after idling, it is not sufficient for it to remember what the available capacity was when it was last active, because in the meantime other traffic might have started to use the same capacity, or flows that were using the same capacity might have finished, leaving much more available capacity than earlier.
These problems do not even only arise when a flow starts or restarts. If two flows are sharing the same capacity they will continually slowly try to use more capacity, deliberately causing regular buffer overflows and losses. When either flow detects a loss, it responds by slowing down. The outcome of all the increases and all the decreases leads each flow to consume a proportion of the capacity on average. However, when one flow finishes, the other flow is never told explicitly that more capacity is available. It merely continues to increase slowly for what can be a very long time before it eventually consumes all the capacity the other flow freed up.
Recently, new TCP algorithms such as Cubic TCP have been designed that seek out newly available capacity more quickly. However, the faster they find new capacity, the more they overshoot between reaching the new limit of available capacity and detecting that they have reached it a round trip later.
As the capacity of Internet links increases, and the bit-rates that flows use increase, this open-loop control dilemma between increasing too slowly and overshooting too much gets progressively more serious.
A number of different methods for signaling congestion in packet networks i.e. that queues are building up, are known in the prior art, for example active queue management (AQM) techniques (e.g. RED, REM, PI, PIE, CoDel) can be configured to drop a proportion of packets when it is detected that a queue is starting to grow but before the queue is full. All AQM algorithms drop more packets as the queue grows longer.
An active queue management algorithm can be arranged to discard a greater proportion of traffic marked with a lower class-of-service, or marked as out-of-contract. For instance, weighted random early detection [WRED] determines whether to drop an arriving packet using the RED AQM algorithm but the parameters used for the algorithm depend on the class of service marked on each arriving packet.
Explicit Congestion Notification (ECN) [RFC3168] conveys congestion signals in TCP/IP networks by means of a two-bit ECN field in the IP header, whether in IPv4 (FIG. 2) or IPv6 (FIG. 3). Prior to the introduction of ECN, these two bits were present in both types of IP header, but always set to zero. Therefore, if these bits are both zero, a queue management process assumes that the packet comes from a transport protocol on the end-systems that will not understand the ECN protocol so it only uses drop, not ECN, to signal congestion.
The meaning of all four combinations of the two ECN bits in IPv4 or IPv6 is shown in FIG. 4. If either bit is one, it tells a queue management process that the packet has come from an ECN-capable transport (ECT), i.e. both the sender and receiver understand ECN marking, as well as drop, as a signal of congestion.
When a queue management process detects congestion, for packets with a non-zero ECN field, it sets the ECN field to the Congestion Experienced (CE) codepoint. On receipt of such a marked packet, a TCP receiver sets the Echo Congestion Experienced (ECE) flag in the TCP header of packets it sends to acknowledge the data packets it has received. A standard TCP source interprets ECE feedback as if the packet has been dropped, at least for the purpose of its rate control. But of course, it does not have to retransmit the ECN marked packet.
Drop and congestion signals are not mutually exclusive signals, and flows that enable ECN have the potential to detect and respond to both signals.
The ECN standard [RFC3168] deliberately assigns the same meaning to both the ECN codepoints with one bit set (01 and 10). They both mean that the transport is ECN-capable (ECT), and if they need to be distinguished they are termed ECT(1) and ECT(0) respectively. The intention was to allow scope for innovative new ways to distinguish between these fields to be proposed in future.
There are some known alternative uses for the two ECN-capable transport (ECT) codepoints.
One idea has been to use the ECT(1) value to signal an intermediate level of congestion between uncongested (ECT(0)) and congested (CE). This idea has been standardised in one variant of an approach termed pre-congestion notification (PCN [RFC5670]). PCN uses a virtual queue, which is not actually a queue; rather it is a number that represents the length of queue that would have formed if the buffer were drained more slowly than the real buffer drains. One variant of PCN uses two virtual queues one configured to drain at a slower rate than the other. When the slower virtual queue fills, it marks packets with the ECT(1) codepoint and when the faster virtual queue fills it marks packets with the CE codepoint. The PCN approach is not standardised to be used as a signal to end-systems, only within the network.
AQM and ECN are not exclusive to IP-aware devices. Many non-IP devices use AQM to drop packets before the queue fills the buffer and some protocols layered below IP include the facility to signal congestion explicitly instead of dropping packets (e.g. MPLS [RFC5129]). Such lower-layer protocols typically encapsulate an IP packet. Taking the case of MPLS as an example, when the inner IP packet is decapsulated at the egress edge of an MPLS subnet, any ECN marking on the outer MPLS header is propagated into the inner IP header to be forwarded onward to its destination.
The encoding of ECN in the MPLS header is flexible enough to be able to define more than one level of severity for congestion notification, at least within the constraints of the size of the MPLS header. Then, for instance, the two levels of PCN marking can be encoded.
In “Single PCN threshold marking by using PCN baseline encoding for both admission and termination controls”, appendix D, by D. Satoh et al [1],a mechanism is described for marking the proportion of packets that represents the instantaneous utilisation of a logical link. Utilisation of the logical link is signaled by marking the ECN field of every packet that arrives when the virtual queue is non-empty. The proportion of bits in marked packets relative to all bits then represents instantaneous utilisation, but the representation is only precise for a Poisson distribution of inter-arrival times.
There have been other proposals from the research community for a network node to signal an early warning of impending congestion to end-systems as well signaling actual queue growth, in order to address the open-loop control problem at the start of a new data flow. For instance VCP in “One more bit is enough” by Yong Xia et al [2], uses the ECT(1) codepoint of the ECN field to signal when utilisation of a link has exceeded a set threshold, in a similar way to PCN.
In “AntiECN Marking: A Marking Scheme for High Bandwidth Delay Connections”, S. Kunniyur [3], each packet carries a bit called the Anti-ECN bit in its header. The bit is initially set to zero. Each router along the packet's route checks to see if it can allow the flow to increase its sending rate by determining whether the packet has arrived at an empty virtual queue. If so, the router sets the bit to one. If on arrival the virtual queue is non-empty, it sets the bit to zero. The receiver then echoes the bit back to the sender using the ACK packet. If the bit is set to one, the sender increases its congestion window and hence its rate.
Patent application US2002009771 [6] discloses a traffic shaping and scheduling function for the release of packets from a queue. A packet eligible for transmission is provided with a tag, created for the packet, which tag may operate as a criterion for sorting the packet in a binary tree of tags. The tag is one of the determinants of the order in which several packets are selected for transmission, however, the tag is not used for indicating any kind of congestion at the node.