A data source faces a dilemma whenever it has little or no information about how much capacity is available, but it needs to send data as fast as possible without causing undue congestion. A data source faces this dilemma every time it starts a new data flow, every time it re-starts after an idle period, and every time another flow finishes that has been sharing the same capacity.
The family of congestion control algorithms that have been proposed for TCP combine two forms of operation: one dependent on congestion feedback (closed-loop control), the other at times when there is no feedback (open-loop control). On the current Internet, open loop control has to be used at the start or re-start of a flow or at the end of a competing flow, when the sender has little or no information on how much capacity is available.
For instance, a large majority of TCP algorithms uses the same ‘slow-start’ algorithm to exponentially increase the sending rate, probing for more capacity by doubling the sending rate every round trip, until the receiver feeds back that it has detected a loss as the first signal of congestion. The sender receives this feedback one round trip time after its sending rate exceeded available capacity. By the time it receives this signal it will already be sending more than twice as fast as the available capacity.
A concept called the congestion window is used within the TCP algorithm to control its rate. The window is the amount of data that can be sent in excess of the data that has been acknowledged. With little or no knowledge of the available capacity (open-loop) it is difficult to argue whether one congestion window is better than another—any behaviour could be safe in some circumstances and unsafe in others. Internet standards say a flow should start with a window of no more than 4380 B (3 full-sized packets over Ethernet), and a window of 10 packets is currently being experimented with. Numbers like these are set by convention to control a flow's behaviour while it has no better information about actual available capacity (open-loop control). Similarly, there is no particular reason why TCP doubles its window every round trip during its start-up phase. Doubling certainly matches the halving that another part of the TCP algorithm does during its closed-loop (or ‘congestion avoidance’) phase. However, the choice of the number two for doubling and halving was fairly arbitrary.
This doubling does not always interact well with non-TCP traffic. Consider the case of a low-rate (e.g. 64 kb/s) constant-bit-rate voice flow in progress over an otherwise empty 1 Gb/s link. Further imagine that a large TCP flow starts on the same link with an initial congestion window of ten 1500 B packets and a round trip time of 200 ms. To discover how much capacity is available, the flow keeps doubling its window every round trip until, after nearly eleven round trips, its window is 16,667 packets per round (1 Gb/s), and at some point during the twelfth round trip it will have filled the buffer of the 1 Gb/s link too. We will assume the buffer has been sized to take a full window of packets (16,667) therefore it will take another round for the sender to fill the buffer at which point its window will have grown to 33,333 packets (2 Gb/s). One round later, it will get the first feedback detecting drops that will imply that a round trip earlier it exceeded both available capacity and the buffer, so the sender will halve its window. However, just before that point its window would have been 66,667 packets, representing four times the link rate or 4 Gb/s. About 50% of the packets in this next round (33,333 packets) will be dropped. This huge loss of packets is the best case scenario if the buffer is correctly sized for a single flow. Even if the buffer were sized for multiple flows (say 25), 20,000 packets would still have to be discarded (16,667*(1+1/√25)=20,000).
In this example TCP has already taken 12 round trip times, over 2 seconds in this case, to find its correct operating rate. Further, when TCP drops such a large number of packets, it can take a long time to recover, sometimes leading to a black-out of many more seconds (100 seconds has been reported [Ha08] due to long time-outs or the time it takes for the host to free-up large numbers of buffers). In the process, the voice flow is also likely to black-out for at least 200 ms and often much longer, due to at least 50% of the voice packets being dropped over this period.
This shows there are two problems during flow-startup: i) a long time before a flow stabilises on the correct rate for the available capacity and ii) a very large amount of loss damage to itself and to other flows before a newly starting flow discovers it has increased its rate beyond the available capacity (overshoot).
These problems do not only arise when a new flow starts up. A very similar situation occurs when a flow has been idle for a time, then re-starts. When a flow restarts after idling, it is not sufficient for it to remember what the available capacity was when it was last active, because in the meantime other traffic might have started to use the same capacity, or flows that were using the same capacity might have finished, leaving much more available capacity than earlier.
These problems do not even only arise when a flow starts or restarts. If two flows are sharing the same capacity they will continually slowly try to use more capacity, deliberately causing regular buffer overflows and losses. When either flow detects a loss, it responds by slowing down. The outcome of all the increases and all the decreases leads each flow to consume a proportion of the capacity on average. However, when one flow finishes, the other flow is never told explicitly that more capacity is available. It merely continues to increase slowly for what can be a very long time before it eventually consumes all the capacity the other flow freed up.
Recently, new TCP algorithms such as Cubic TCP have been designed that seek out newly available capacity more quickly. However, the faster they find new capacity, the more they overshoot between reaching the new limit of available capacity and detecting that they have reached it a round trip later.
As the capacity of Internet links increases, and the bit-rates that flows use increase, this open-loop control dilemma between increasing too slowly and overshooting too much gets progressively more serious.
A number of different methods for signalling congestion in packet networks i.e. that queues are building up, are known in the prior art, for example active queue management (AQM) techniques (e.g. RED, REM, PI, PIE, CoDel) can be configured to drop a proportion of packets when it is detected that a queue is starting to grow but before the queue is full. All AQM algorithms drop more packets as the queue grows longer.
An active queue management algorithm can be arranged to discard a greater proportion of traffic marked with a lower class-of-service, or marked as out-of-contract. For instance, weighted random early detection [WRED] determines whether to drop an arriving packet using the RED AQM algorithm but the parameters used for the algorithm depend on the class of service marked on each arriving packet.
Explicit Congestion Notification (ECN) [RFC3168] conveys congestion signals in TCP/IP networks by means of a two-bit ECN field in the IP header, whether in IPv4 (FIG. 2) or IPv6 (FIG. 3). Prior to the introduction of ECN, these two bits were present in both types of IP header, but always set to zero. Therefore, if these bits are both zero, a queue management process assumes that the packet comes from a transport protocol on the end-systems that will not understand the ECN protocol so it only uses drop, not ECN, to signal congestion.
The meaning of all four combinations of the two ECN bits in IPv4 or IPv6 is shown in FIG. 4. If either bit is one, it tells a queue management process that the packet has come from an ECN-capable transport (ECT), i.e. both the sender and receiver understand ECN marking, as well as drop, as a signal of congestion.
When a queue management process detects congestion, for packets with a non-zero ECN field, it sets the ECN field to the Congestion Experienced (CE) codepoint. On receipt of such a marked packet, a TCP receiver sets the Echo Congestion Experienced (ECE) flag in the TCP header of packets it sends to acknowledge the data packets it has received. A standard TCP source interprets ECE feedback as if the packet has been dropped, at least for the purpose of its rate control. But of course, it does not have to retransmit the ECN marked packet.
Drop and congestion signals are not mutually exclusive signals, and flows that enable ECN have the potential to detect and respond to both signals.
The ECN standard [RFC3168] deliberately assigns the same meaning to both the ECN codepoints with One bit set (01 and 10). They both mean that the transport is ECN-capable (ECT), and if they need to be distinguished they are termed ECT(1) and ECT(0) respectively. The intention was to allow scope for innovative new ways to distinguish between these fields to be proposed in future.
A number of authors have proposed techniques to mitigate the dilemma between starting a data flow fast and overshooting. This research has mostly remained relatively obscure either because it improves only one half of the dilemma at the expense of the other, or because the proposals have been considered impractical to deploy. Also, most researchers have focused on the closed-loop phase of congestion control, perhaps being unaware that the open-loop phase is becoming the dominant problem as rates increase. The proposals fall into two groups i) those that propose to solely change end-systems and ii) those that propose to change both end-systems and queuing algorithms.
Paced Start [Hu03] proposes to solely change the sender, to monitor the queuing delay that a buffer adds between packets when sent in trains during TCP slow-start. Then it paces the packets sent in subsequent rounds. This avoids TCP's overshoot, but it takes even longer than TCP's slow-start to reach the available capacity.
Hybrid slow-start [Ha08] keeps TCP's slow-start algorithm unchanged but the sender attempts to stop doubling the congestion window at the point it will start to overshoot, rather than a round trip time after it has overshot. It does this by monitoring increases in the delays between the early acknowledgements in each round, and by monitoring when the duration of each whole acknowledgement train approaches the round-trip time. Although hybrid slow-start was deployed in Linux, it is typically turned off because it seems to reduce performance more often than it improves it. This is because sometimes it ends the start-up phase too early and then takes a long time to reach the available capacity.
CapStart [Cav09] uses packet-pair delay measurements similarly to HSS in order to end slow-start early (limited slow-start). However it makes great gains by reverting to classic slow-start if it measures that the bottleneck is probably at the sender not in the network, in which case there will be no large loss episode to avoid. The experimentation with CapStart confined itself to scenarios with no cross-traffic, in order to remain tractable.
Liu et al [Liu07] investigated what the impact would be if every flow simply sent all its data paced out over the first round trip time (termed Jump Start). If acknowledgements report losses or if the first acknowledgement returns while there is still data to send, the algorithm moves into TCP's standard retransmission and congestion avoidance behaviour. The authors monitored current Internet flows and found that only about 7.4% of them comprise more than the three packets that a sender would send immediately anyway under the existing standard behaviour. The paper is inconclusive on whether the edges of the Internet would cope with the very high loss rates that this 7.4% of flows would cause (because they represent a very much larger proportion of the bytes on the Internet).
Although [Liu07] is primarily about a change to the sender only, it mentions that senders could mark any packets in excess of the three allowed in the first round as eligible for preferential discard by switches and routers. This would protect competing flows from any overshoot, but it would require preferential discard to be enabled at all potential bottleneck buffers. The rest of the schemes described below also require both end-systems and network buffers to be modified.
Fast-Start [Padman98] uses a possibly stale congestion window from previous connections during start-up. However, to compensate, it sends packets with higher drop priority (i.e. more likely to be dropped). It also improves TCP's handling of losses to cope with the higher loss-probability. Higher drop probability is defined as follows: “The router implements a simple packet drop priority mechanism. It distinguishes between packets based on a 1-bit priority field. When its buffer fills up and it needs to drop a packet, it picks a low-priority packet, if available, first. Since fast start packets are assigned a low priority, this algorithm ensures that an over-aggressive fast start does not cause (non-fast start) packets of other connections to be dropped.”
TCP-Peach [Akyildiz01] also uses probe packets that are marked to be treated by the network with lower priority in order to detect spare capacity in a satellite network context.
Quick-start involves a modification to TCP for the sender to explicitly ask all routers on the path what bit-rate it should start at. Quick-start will not work well unless every router has been upgraded to participate in the signalling. Also Quick-start doesn't have a way to signal to lower-layer switches that are not IP-aware and it requires that all sources are trusted by the network to subsequently send at the rate the network asks them to send at.
U.S. Pat. No. 7,680,038 (Gourlay) discloses a method for optimizing bandwidth usage while controlling latency. Gourlay teaches switching between a “probe mode” and a “steady mode”. In the probing mode a bandwidth estimation module determines the available bandwidth for a connection by sending “bursts” of packets and ramp up, or increase, the available bandwidth until an acknowledgment packet indicating a loss of a packet is received, and for the next burst the available bandwidth is decreased. After an estimated available bandwidth is determined data is sent out at a fraction of the estimated available bandwidth.
There are some known alternative uses for the two ECN-capable transport (ECT) codepoints.
One idea has been to use the ECT(1) value to signal an intermediate level of congestion between uncongested (ECT(0)) and congested (CE). This idea has been standardised in one variant of an approach termed pre-congestion notification (PCN [RFC5670]). PCN uses a virtual queue, which is not actually a queue; rather it is a number that represents the length of queue that would have formed if the buffer were drained more slowly than the real buffer drains. One variant of PCN uses two virtual queues one configured to drain at a slower rate than the other. When the slower virtual queue fills, it marks packets with the ECT(1) codepoint and when the faster virtual queue fills it marks packets with the CE codepoint. The PCN approach is not standardised to be used as a signal to end-systems, only within the network however, virtual queues have been used to signal to end-system algorithms, e.g. High Utilisation Ultra Low Latency (HULL) [Alizadeh12].
In “Single PCN threshold marking by using PCN baseline encoding for both admission and termination controls”, appendix D, by D. Satoh et al [1], a mechanism is described for marking the proportion of packets that represents the instantaneous utilisation of a logical link. Utilisation of the logical link is signalled by marking the ECN field of every packet that arrives when the virtual queue is non-empty. The proportion of bits in marked packets relative to all bits then represents instantaneous utilisation, but the representation is only precise for a Poisson distribution of inter-arrival times. Again the technique in Satoh et al was proposed in the context of admission control signalling, but it would be used in a similar way to HULL by end-systems for congestion control.
There have been other proposals from the research community for a network node to signal an early warning of impending congestion to end-systems as well signalling actual queue growth; in order to address the open-loop control problem at the start of a new data flow. For instance VCP in “One more bit is enough” by Yong Xia et at [2], uses the ECT(1) codepoint of the ECN field to signal when utilisation of a link has exceeded a set threshold, in a similar way to PCN.
In “AntiECN Marking: A Marking Scheme for High Bandwidth Delay Connections”, S. Kunniyur [3], each packet carries a bit called the Anti-ECN bit in its header. The bit is initially set to zero. Each router along the packet's route checks to see if it can allow the flow to increase its sending rate by determining whether the packet has arrived at an empty virtual queue. If so, the router sets the bit to one. If on arrival the virtual queue is non-empty, it sets the bit to zero. The receiver then echoes the bit back to the sender using the ACK packet. If the bit is set to one, the sender increases its congestion window and hence its rate.