1. Field of the Invention
The present invention relates to the field of networks and to methods and apparatus for congestion control.
Portions of the disclosure of this patent document contain material that are subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office file or records, but otherwise reserves all rights whatsoever.
2. Background Art
Computer networks allow communication between one or more computers. Networks include local area networks (LANs), wide area networks (WANs), the Internet, wireless networks, mixed device networks, and others. One limitation to the efficient use of networks is network congestion, when the number of message sources and destinations, and the amount of message traffic, is greater than the network can handle efficiently. In the prior art, such congestion problems are handled by implementing congestion control.
Congestion control is a distributed algorithm to share network resources among competing users. It is used in situations where the availability of resources and the set of competing users vary over time unpredictably, yet efficient sharing is desired. These constraints, unpredictable supply and demand and efficient operation, have been solved in the prior art by using feedback control. Feedback control, also referred to as “closed loop” control, involves the use of some metric to determine dynamic and typically real-time adjustment of a system to provide optimum results. Such systems are distinguished from so called “open loop” control systems where there is no feedback (for example, cooking a turkey without using a meat thermometer is open loop and using a meat thermometer is a closed loop feedback system).
In this prior art approach, traffic sources dynamically adapt their rates in response to congestion in their paths. An example of a network that uses feedback control as a congestion control is the Internet (using Transmission Control Protocol (TCP) in source and destination computers involved in data transfers). Note that although we discuss the Internet, the present application applies to other networks as well.
The congestion control algorithm in the current TCP, also known as “Reno”, was developed in 1988 and has gone through several changes since. Current research predicts that as bandwidth-delay product continues to grow, TCP Reno will eventually become a performance bottleneck. In other words, the very control system used to manage congestion will lead to unsolvable congestion even as the network (Internet) continues to offer higher bandwidth and performance. Initially this may seem counter-intuitive. How can a system perform worse when it has greater resources with which to work? The following four difficulties contribute to the poor performance of TCP Reno in networks with large bandwidth-delay products.
1. At the packet level, linear increase by one packet per Round-Trip Time (RTT) is too slow, and multiplicative decrease per loss event is too drastic. Current schemes use this “speed up slowly/slow down quickly” approach to packet traffic control and it is not effective in high bandwidth systems.
2. At the flow level, maintaining large average congestion windows requires an extremely small equilibrium loss probability, and maintaining such a small loss probability is not practical in prior art systems.
3. At the packet level, oscillation is unavoidable because TCP uses a binary congestion signal (packet loss).
4. At the flow level, the dynamics are unstable, leading to severe oscillations that can only be reduced by the accurate estimation of packet loss probability and a stable design of the flow dynamics. Current systems do not allow for accurate enough estimation of packet loss.
Flow Level and Packet Level
A congestion control algorithm can be designed at two levels. The flow-level (macroscopic) design aims to achieve high utilization, low queuing delay and loss, fairness, and stability. The packet-level design implements these flow level goals within the constraints imposed by end-to-end control. Historically for TCP Reno, packet-level implementation was introduced first. The resulting flow-level properties, such as fairness, stability, and the relationship between equilibrium window and loss probability, were then understood as an afterthought. In contrast, other prior art packet-level designs such as HSTCP and STCP are guided by flow-level goals.
Packet and Flow Level Modeling
The congestion avoidance algorithm of TCP Reno and its variants use a window adjustment technique (from the AIMD algorithm) as follows:
            Ack      ⁢              :            ⁢                          ⁢      w        ←          w      +              1        w                        Loss      ⁢              :            ⁢                          ⁢      w        ←          w      -                        1          2                ⁢        w            Although a packet level model, this induces certain flow level properties such as throughput, fairness and stability and can be understood by a flow level model of the AIMD algorithm. The window wi(t) of source i increases by one packet per RTT and decreases per unit time by
            x      i        ⁡          (      t      )        ⁢                    q        i            ⁡              (        t        )              ·          1      2        ·          4      3        ⁢            w      i        ⁡          (      t      )        ⁢          ⁢  packetswherexi(t):=wi(t)/Ti(t)pkts/secTi(t) is the round-trip time and qi(t) is the delayed end-to end loss probability, in period t2. Here, 4wi(t)/3 is the peak window size that gives the average window of wi(t). Hence, a flow level model of AIMD is:
                                                        w              .                        i                    ⁡                      (            t            )                          =                              1                                          T                i                            ⁡                              (                t                )                                              -                                    2              3                        ⁢                                          x                i                            ⁡                              (                t                )                                      ⁢                                          q                i                            ⁡                              (                t                )                                      ⁢                                          w                i                            ⁡                              (                t                )                                                                        (        1        )            Setting wi(t)=0 in (1) yields 1 over the square root of q formula for TCP Reno which relates loss probability to window size in equilibrium:
                              q          i          *                =                  3                      2            ⁢                          w              i                              *                2                                                                        (        2        )            
From (1),
            q      i      *        ⁢          w      i      *        =      3          2      ⁢              w        i        *            It can be seen that the number of packet losses per round trip time decreases in proportion of the equilibrium window size.Defining
            κ      i        ⁡          (                        w          i                ,                  T          i                    )        =                    1                  T          i                    ⁢                          ⁢      and      ⁢                          ⁢                        u          i                ⁡                  (                                    w              i                        ,                          T              i                                )                      =          1.5              w        i        2            and noting that wi=xiTi, (1) can be expressed as:
                                                        w              .                        i                    ⁡                      (            t            )                          =                              κ            ⁡                          (              t              )                                ⁢                      (                          1              -                                                                    q                    i                                    ⁡                                      (                    t                    )                                                                                        u                    i                                    ⁡                                      (                    t                    )                                                                        )                                              (        3        )            where we have used the shorthand ki(t)=ki(wi(t); Ti(t)) and ui(t)=ui(wi(t); Ti(t)). It can be shown that different variants of TCP all have the same dynamic structure (3) at the flow level. They differ in the choices of the gain function ki and marginal utility function ui, and whether the congestion measure qi is loss probability or queuing delay.Equilibrium Problem
The equilibrium problem at the flow level is expressed in (2): the end-to-end loss probability must be small to sustain a large window size, making the equilibrium difficult to maintain in practice, as bandwidth-delay product increases.
Even though equilibrium is a flow-level notion, this problem manifests itself at the packet level, where a source increments its window too slowly and decrements it too drastically. Prior art approaches can be compared to driving in a car and only being able to see 10 feet ahead of your car. The car accelerates slowly until the driver sees another car and then rapidly brakes to avoid collision. This works well in a parking lot where speeds are low and space is limited. But the same system on a freeway does not work because the greater speeds that can be obtained are negated by the limited look ahead of the driver. The result is continuous acceleration and braking, eliminating all advantages of the greater room and speed. The same thing applies when the current systems are applied to high bandwidth networks.
For example, when the peak window is 80,000-packets (corresponding to an “average” window of 60,000 packets, necessary to sustain 7.2 Gbps using 1,500-byte packets with a RTT of 100 ms) it takes 40,000 RTTs, or almost 70 minutes, to recover from a single packet loss.
This disadvantage is illustrated in FIG. 1A, where the size of window increment per RTT and decrement per loss, 1 and 0:5wi, respectively, are plotted as functions of 0.5wi. The increment function for Reno (and for HSTCP) is almost indistinguishable from the x-axis. Moreover, the gap between the increment and decrement functions grows rapidly as wi increases. Since the average increment and decrement must be equal in equilibrium, the required loss probability can be exceedingly small at large wi. This picture is thus a visualization of (2).
The causes of the oscillatory behavior of TCP Reno lie in its design at both the packet and flow levels. At the packet level, the choice of binary congestion signal necessarily leads to oscillation, and the parameter setting in Reno worsens the situation as bandwidth-delay product increases. At the flow level, the system dynamics given by (1) is unstable at large bandwidth-delay products. These problems must be addressed by different means.
FIG. 2A illustrates the operating points chosen by various TCP congestion control algorithms, using the single-link single-flow scenario. It shows queuing delay as a function of window size. Queuing delay starts to build up after point C where window equals bandwidth-propagation-delay product, until point R where the queue overflows. Since Reno oscillates around point R, the peak window size goes beyond point R. The minimum window in steady state is half of the peak window. This is the basis for the rule of thumb that bottleneck buffer should be at least one bandwidth-delay product. The minimum window will then be above point C, and the buffer will not empty in steady state operation, yielding full utilization.
In the loss-based approach, full utilization, even if achievable, comes at the cost of severe oscillations and potentially large queuing delay. The DUAL scheme proposes to oscillate around point D, the midpoint between C and R when the buffer is half-full. DUAL increases the congestion window linearly by one packet per RTT, as long as queuing delay is less than half of the maximum value, and decreases multiplicatively by a factor of ⅛, when queuing delay exceeds half of the maximum value. The scheme CARD (Congestion Avoidance using Round-trip Delay) proposes to oscillate around point C through AIMD with the same parameter (1; 1=8) as DUAL, based on the ratio of round-trip delay and delay gradient, to maximize power. In all these schemes, the congestion signal is binary, and hence the congestion window must oscillate.
The congestion window can be stabilized if multi-bit feedback is used. The congestion window is adjusted in an equation based control scheme based on the estimated loss probability in an attempt to stabilize around a target value given by (2). Its operating point is T in FIG. 2B, near the overflowing point. This approach eliminates the oscillation due to packet-level AIMD, but two difficulties remain at the flow level.
First, equation-based control requires the explicit estimation of end-to-end loss probability. This is difficult when the loss probability is small. Second, even if loss probability can be perfectly estimated, Reno's flow dynamics, described by equation (1) leads to a feedback system that becomes unstable as feedback delay increases, and again, strikingly, as network capacity increases. The instability at the flow level can lead to severe oscillations that can be reduced only by stabilizing the flow level dynamics.
Loss Based Approach
Two loss based approaches to these problems are HSTCP and STCP, but neither provides full and complete solutions to prior art disadvantages.
HSTCP
The design of HSTCP proceeded almost in the opposite direction to that of TCP Reno. The system equilibrium at the flow-level is first designed, and then, the parameters of the packet-level implementation are determined to implement the flow-level equilibrium. The first design choice decides the relation between window w*i and end-to-end loss probability q*i in equilibrium for each source i:
                              q          i          *                =                  0.0789                      w            i                          *              1.1976                                                          (        4        )            
The second design choice determines how to achieve the equilibrium defined by (4) through packet-level implementation. The (congestion avoidance) algorithm is AIMD, as in TCP Reno, but with parameters a(wi) and b(wi) that vary with source i's current window wi. The pseudo code for window adjustment is:
            Ack      ⁢              :            ⁢                          ⁢      w        ←          w      +                        a          ⁡                      (            w            )                          w                        Loss      ⁢              :            ⁢                          ⁢      w        ←          w      -                        b          ⁡                      (            w            )                          ⁢        w            
The design of a(wi) and b(wi) functions is as follows. From a discussion of the single-flow behavior, this algorithm yields an equilibrium where the following holds
                                                                                                              a                    ⁡                                          (                                              w                        i                        *                                            )                                                                            b                    ⁡                                          (                                              w                        i                        *                                            )                                                                      ·                                  (                                      1                    -                                                                  b                        ⁡                                                  (                                                      w                            i                            *                                                    )                                                                    2                                                        )                                            =                            ⁢                                                q                  i                  *                                ⁢                                  w                  i                                      *                    2                                                                                                                          =                            ⁢                              0.0789                ⁢                                  w                  i                                      *                    0.8024                                                                                                          (        5        )                            where the last equality follows from (4). This motivates the design that, when loss probability qi and the window wi are not in equilibrium, one chooses a(wi) and b(wi) to force the relation (5) “instantaneously”:        
                                                        a              ⁡                              (                                  w                  i                                )                                                    b              ⁡                              (                                  w                  i                                )                                              ·                      (                          1              -                                                b                  ⁡                                      (                                          w                      i                                        )                                                  2                                      )                          =                  0.0789          ⁢                      w            i            0.8024                                              (        6        )            
The relation (6) defines a family of a(wi) and b(wi) functions. Picking either one of a(wi) and b(wi) function uniquely determines the other function. The next design choice made is to pick a b(wi), hence also fixing a(wi). The choice of b(wi) is, for wi between 38 and 83,333 packets,b(wi)=−k1 logewi+k2  (7)                where kl=j0:0520 and k2=0:6892. This fixes a(wi) to be, from (6),        
      a    ⁡          (              w        i            )        =      0.1578    ⁢          w      i      0.8024        ⁢                  b        ⁡                  (                      w            i                    )                            2        -                  b          ⁡                      (                          w              i                        )                                              where b(wi) is given by (7). For wi less than or equal to 38 packets, a(wi)=1, b(wi)=0:5, and HSTCP reduces to TCP Reno. For wi (from 38 to 83,000 packets), b(wi) varies between [0:1; 0:5]. The flow level model of HSTCP can be modeled using a similar argument to derive (1) for TCP Reno:        
                                                        w              .                        i                    ⁡                      (            t            )                          =                                            a              ⁡                              (                                                      w                    i                                    ⁡                                      (                    t                    )                                                  )                                                                    T                i                            ⁡                              (                t                )                                              -                                                    2                ⁢                                  b                  ⁡                                      (                                                                  w                        i                                            ⁡                                              (                        t                        )                                                              )                                                                              2                -                                  b                  ⁡                                      (                                                                  w                        i                                            ⁡                                              (                        t                        )                                                              )                                                                        ⁢                                          x                i                            ⁡                              (                t                )                                      ⁢                                          q                i                            ⁡                              (                t                )                                      ⁢                                          w                i                            ⁡                              (                t                )                                                                            =                                            2              ⁢                              b                ⁡                                  (                                                            w                      i                                        ⁡                                          (                      t                      )                                                        )                                                                                                      T                  i                                ⁡                                  (                  t                  )                                            ⁢                              (                                  2                  -                                      b                    ⁡                                          (                                                                        w                          i                                                ⁡                                                  (                          t                          )                                                                    )                                                                      )                                              ·                      (                                                                                a                    ⁡                                          (                                                                        w                          i                                                ⁡                                                  (                          t                          )                                                                    )                                                                            b                    ⁡                                          (                                                                        w                          i                                                ⁡                                                  (                          t                          )                                                                    )                                                                      ⁢                                  (                                      1                    -                                                                  b                        ⁡                                                  (                                                                                    w                              i                                                        ⁡                                                          (                              t                              )                                                                                )                                                                    2                                                        )                                            -                                                                    q                    i                                    ⁡                                      (                    t                    )                                                  ⁢                                                      w                    i                    2                                    ⁡                                      (                    t                    )                                                                        )                              
Using (6) to replace the first term in parentheses gives:
                                                        w              .                        i                    ⁡                      (            t            )                          =                                            2              ⁢                              b                ⁡                                  (                                                            w                      i                                        ⁡                                          (                      t                      )                                                        )                                                                                                      T                  i                                ⁡                                  (                  t                  )                                            ⁢                              (                                  2                  -                                      b                    ⁡                                          (                                                                        w                          i                                                ⁡                                                  (                          t                          )                                                                    )                                                                      )                                              ·                      (                                          0.0789                ⁢                                                      w                    i                    0.8024                                    ⁡                                      (                    t                    )                                                              -                                                                    q                    i                                    ⁡                                      (                    t                    )                                                  ⁢                                                      w                    i                    2                                    ⁡                                      (                    t                    )                                                                        )                                              (        8        )            
In summary, the model of HSTCP is given by (4), (8) and (7).
Scalable TCP (STCP)
The (congestion avoidance) algorithm of STCP is MIMD:                Ack: w←w+a        Loss: w←w−bw        for some constants 0<a; b<1. Note that in each round-trip time without packet loss, the window increases by a multiplicative factor of a. The recommended values in some implementations are a=0:01 and b=0:125.        
As for HSTCP, the flow-level model of STCP is
            w      .        i    =                    a        ⁢                                  ⁢                              w            i                    ⁡                      (            t            )                                      T        i              -                            2          ⁢          b                          2          -          b                    ⁢                        x          i                ⁡                  (          t          )                    ⁢                        q          i                ⁡                  (          t          )                    ⁢                        w          i                ⁡                  (          t          )                                    where xi(t):=wi(t)=Ti. In equilibrium, we have        
                                          q            i            *                    ⁢                      w            i            *                          =                                            a              b                        ⁢                          (                              1                -                                  b                  2                                            )                                =                      :            ρ                                              (        9        )            
This implies that, on average, there are p loss events per round-trip time, independent of the equilibrium window size. We can rewrite (9) in the form of (3) with the gain and marginal utility functions:
                    κ        i            ⁡              (                              w            i                    ,                      T            i                          )              =                  a        ⁢                                  ⁢                  w          i                            T        i                                u        i            ⁡              (                              w            i                    ,                      T            i                          )              =          ρ              w        i            
The increment and decrement functions of HSTCP and STCP are shown plotted in FIG. 1A. Both upper bound those of Reno: they increase more aggressively and decrease less drastically, so that the gap between the increment and decrement functions is narrowed. At the flow level, this means that, in equilibrium, both HSTCP and STCP can tolerate larger loss probabilities than TCP Reno (compare (4) and (9) with (2)). This alleviates the some problems with TCP Reno. It does not, however, solve the dynamic problems at the packet and the flow levels.