A network flow is a sequence of data packets that carry identical values in the source address field of the Internet Protocol (IP) packet header, in the destination address field of the IP header, possibly in other fields of the IP header, and also in fields of other protocol headers, such as the source and destination port number fields of the Transmission Control Protocol (TCP) header. An example of a network flow is the sequence of packets generated by the traffic source (or the sender) of a TCP connection.
Packet switches and routers in many portions of the network allocate a single queue to multiple flows whose packets are to be dispatched over the same output link. An alternative to using a single queue for multiple flows is to assign packets of different flows to different queues, but this may be impractical in front of links that accommodate large numbers (thousands and higher) of network flows. In cases where a single queue is used to accommodate packets of multiple flows, the sizing of the buffer space allocated for the queue is typically driven by the need to avoid losses of link utilization in presence of TCP traffic.
The adaptive behavior of TCP sources makes the utilization of a link traversed by TCP traffic very sensitive to the policy that the buffer in front of the link uses for deciding on the admission of new packets at times of congestion. In absence of packet losses, the TCP sources keep increasing the amount of traffic they generate, causing further congestion in the packet buffer and filling the available buffer space. Instead, when a packet is dropped the corresponding source reduces its activity, which after some time relieves congestion in front of the bottlenecked link. Full utilization of the link is achieved when the buffer is never emptied and is the result of a fine balance between the fraction of TCP sources that recognize packet losses and the fraction of TCP sources that are allowed to keep increasing their traffic generation rate.
For many years, it was commonly accepted within the IP networking research and development community that, in front of a link with capacity C, a buffer space of C· θ should be allocated for a queue handling TCP traffic flows, where θ is the average packet round-trip time (RTT) estimated over all the TCP flows in the queue. The goal of this buffer allocation criterion, first advocated in C. Villamizar and C. Song, “High-Performance TCP in ANSNET,” ACM Computer Communications Review, 24(5):45-60, 1994 [Villamizar, 1994] and commonly referred to as the bandwidth-delay product (BDP) rule, is to avoid queue underflow conditions, and therefore reductions of link utilization, as a consequence of packet losses occurring at the queue at times of traffic congestion. With θ=250 ms and C=40 Gbp·s, which are typical values for 2010 core network links, the buffer space needed in front of the link is 10 Gbit=1.25 GB. This relatively large buffer size constitutes a major issue for network equipment manufacturers and network operators for at least two reasons. First, the size of the buffer makes it impossible to implement on-chip buffer memories, negatively impacting system density, design cost, and energy consumption. Second, a buffer sized with the BDP rule may easily add a contribution in the order of magnitude of the average RTT to the end-to-end forwarding delay of packets. This large added delay, possibly encountered by a packet multiple times along the data path of its network flow, may cause major degradations in the end-user perception of network applications.
In S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Transactions on Networking, 1(4):397-413, 1993 [Floyd, 1993], the authors introduce a buffer management scheme called Random Early Detection (RED) where packets may start being dropped long before the queue occupancy approaches the available buffer space. The purpose of RED is to distribute the losses of packets as fairly as possible across all TCP flows that traverse the queue, and avoid the global synchronization condition, whereby a large number of TCP sources simultaneously stop sending packets after massive back-to-back packet losses, causing reductions of link utilization. With RED, the decision to drop packets is based on the comparison of a small set of buffer occupancy thresholds (bmin and bmax) with an average queue length (AQL) that is updated at every packet arrival. Together with the maximum drop probability pmax, the current placement of the AQL relative to the buffer occupancy thresholds defines the probability of dropping a packet upon its arrival to the queue. While the merits of RED have been generally established, the technique is only partially utilized in practical network equipment because the performance of the scheme is heavily sensitive to the appropriate tuning of the scheme's parameters according to the characteristic of the TCP traffic in the queue.
Having recognized the sensitivity of the RED performance to the degree of matching between the traffic characteristics (mainly qualified by the number of active TCP flows and by the per-flow distribution of RTT values) and the chosen values for the RED configuration parameters (those that define the profile of the packet drop probability curve, plus the averaging weight w, which approximately defines the cutoff frequency of the low-pass filter that implements the AQL), both authors of the original RED have subsequently proposed modifications aimed at improving the performance of the algorithm.
In V. Jacobson, K. Nichols, and K. Poduri, “RED in a Different Light,” Unpublished, 1999, <http://www.cnaf.infn.it/˜ferrari/papers/ispn/red_light—9—30.pdf> [Jacobson, 1999], the authors offer useful recommendations to improve the performance of RED and simplify its configuration. Such recommendations include: (a) updating the AQL with instantaneous queue length samples gathered at fixed time intervals, instead of relying on packet arrivals; (b) setting the cutoff frequency of the low-pass filter that defines the AQL at a value that is low enough to smooth out all queue length dynamics that occur at the same timescale as the RTT; and (c) setting the value of the buffer occupancy threshold where packets start being dropped at bmin=0.3·C· θ, where θ=100 ms if the actual distribution of RTT values is not known. While the recommendations contribute to improving the link utilization of RED, they are not sufficient to avoid losses of link utilization under a broad set of traffic scenarios. Furthermore, the choice of bmin=0.3·C· θ fails to deliver substantial reductions of allocated buffer memory compared to the C· θ mandate of the BDP rule.
In S. Floyd, R. Gummadi, and S. Shenker, “Adaptive RED: An Algorithm for Increasing the Robustness of RED's Active Queue Management,” Unpublished, 2001, <http://icir.org/floyd/papers/adaptiveRed.pdf> [Floyd, 2001], the authors take advantage of the recommendations in [Jacobson, 1999] and of concepts newly presented in W. Feng, D. Kandlur, D. Saha, and K. Shin, “A Self-Configuring RED Gateway,” Proceedings of IEEE Infocom 1999 [Feng, 1999] to define an Adaptive RED (ARED) algorithm where the slope of the drop probability function dynamically adjusts to the evolution of the AQL, increasing it when the AQL exceeds an upper threshold bu and decreasing it when the AQL falls below a lower threshold bl. Compared to the original formulation of RED, the ARED upgrade improves both performance and ease of configuration, and leaves the network administrator with only two parameters to configure: the expected value of θ and the desired value of the average queueing delay d. The algorithm that controls the slope of the packet drop probability is not optimized in [Floyd, 2001] for robustness and speed of convergence. Furthermore, it still relies on the mapping of AQL levels onto packet drop probabilities: the higher the drop rate needed to maintain the packet buffer within stability boundaries, the higher the AQL that sets that drop rate and therefore also the contribution of the buffer to the overall RTT experienced by the TCP flows. Finally, the linear dependency of the packet drop probability on the AQL remains a cause of instability for ARED as it is for the native formulation of RED, and leads to losses of link utilization that may be substantial under ordinary traffic configurations.
More recently, in G. Appenzeller; I. Keslassy, and N. McKeown, “Sizing Router Buffers,” Proceedings of ACM SIGCOMM 2004, Portland, Oreg., August 2004 [Appenzeller, 2004], the authors study the sizing requirements for a buffer that accommodates a large number of desynchronized TCP flows in front of a high-speed network link, concluding that a buffer size in the order of magnitude of C· θ/√{square root over (N)}, where N is the number of desynchronized long-lived TCP flows at the link, is sufficient to keep the probability of occurrence of the buffer underflow condition under a controllable portion of the total time. In a network link with many thousands of desynchronized long-lived bottlenecked TCP flows (a set of TCP flows is desynchronized when the transmission windows of the flows in the set reach their peaks at different times; a long-lived TCP flow is one whose source has left the slow-start state at least once; a bottlenecked TCP flow is one for which the average end-to-end throughput is set by the fair share that the flow receives at the congested buffer under consideration), the authors of [Appenzeller, 2004] state that their small-buffer rule should yield a reduction in the size of the packet buffer that is sufficient to enable its implementation using on-chip memory. However, the robustness and scope of this buffer-size reduction approach have been successfully challenged in many papers that followed, finally inducing the original authors of the proposals to drastically revise their conclusions.
What is desirable is a buffer management/packet admission scheme that allows network system designers to reduce the amount of memory needed for buffering packets in front of network interfaces, so that the same memories can be integrated in the same hardware components that process and forward packets, instead of requiring separate hardware components only for buffering purposes.