The present invention relates to quality of service (QoS) in a computer network, such as those complying with the Internet Protocol (IP).
Routers, switches, and other devices have output ports that interface equipment to a packet network. Examples of output ports include network interface cards (NICs), line cards, links, network interfaces, etc. An output port's packet rate is the rate at which it receives packets from the equipment or when packets otherwise become ready to transport. An output port's link rate is the rate at which it can send the packets into the network, typically related to available bandwidth.
When the packet rate exceeds the link rate, the output port must either discard packets, store packets temporarily in a memory, or perform a combination of these functions. The data structure used to hold packets temporarily may be a queue, but may be more elaborate.
In one conventional technique, when a packet becomes ready to be sent on an output port, it is inserted into that port's queue and each time the port is available to send a new packet, a packet is taken from the queue and transmitted out. If the output port drops packets, a higher-level protocol can deal with recovery from the drop, using FEC, retransmission or other approaches. This may lead to delays for the data in those packets.
An output port is not limited in scope to a physical layer device such as a T-1 interface card or a SONET interface. More generally, a port can be a transmission engine that sends packets according to a bandwidth shaping rule, where the bandwidth may be fixed or may vary with time. For example, an output port may correspond to a virtual private network (VPN) tunnel where network traffic is groomed to a specified transmission rate over that tunnel, or the port may correspond to a rate-limited transmission of network traffic over a higher-capacity physical interface, e.g., an interface that sends packets at 1.5 Mb/s over a 1 Gb/s Ethernet connection coupled to a router that is, in turn, coupled to a T1 connection. Thus, the network traffic is groomed to 1.5 Mb/s in a device that has a 1 Gb/s network interface so that those packets in turn can be transmitted smoothly over a slower speed link.
FIG. 1 is a block diagram of a number of devices forming a network 10. Rate-shaping device 22 is disposed between a set of clients and/or servers 12 on a local area network (LAN) and a wide area network (WAN) router 24. Rate-shaping device 22 grooms the network traffic to a profile that moves the congestion point from WAN router 24 to rate-shaping device 22. Since the traffic is groomed to fit WAN router 24, no queue builds up on WAN router 24, and QoS mechanisms can be effectively implemented at rate-shaping device 22.
The rate-shaping function may be implemented within any device that processes network packets, whether the device operates at the link layer (e.g., a LAN switch, bridge, etc.), at the network layer (e.g., a router, VPN device, NAT, a WAN packet compressor, etc.), at the transport layer (e.g., a layer-4 switch, a transparent TCP proxy, etc.), at the application layer (e.g., a Web proxy, a file cache, an application accelerator, etc.), or any combination thereof. Throughout this disclosure, the term “networking device” is used to refer to any device that performs any combination of functions at any layer in the protocol stack by sending packets to or receiving packets from a network interface port. In general, the term “link rate” refers interchangeably herein either to a physical interface rate or to the rate defined by bandwidth shaping rules associated with a virtual port or the like.
An important consequence of the queuing behavior of IP networks is that packets must spend time waiting in the queues of networking devices. This waiting time, often called the queuing time or queuing delay, may degrade the performance of higher layer protocols and applications that utilize the network path through such devices. Moreover, when the packet rate on a given output port exceeds the port's link capacity for a sustained period of time—a phenomenon called network “congestion”—the queue for that output port continues to grow and, at some point, the networking device will have to discard some packets. There is a delicate tradeoff in how such decisions are made, because if the queue is allowed to grow very large, then the queuing delays become large and adversely impacts performance. Conversely, if the queue is limited to be very small, then the networking device is not able to absorb bursts of traffic and may drop packets too frequently, likewise causing an adverse impact on performance. Sometimes packets are marked to indicate congestion (using explicit congestion notification, or ECN) rather than being dropped to signal to the end points to lower their transmission rates.
The above problems with IP networks are known and a number of techniques have been developed t to manage the manner in which queuing delays manifest themselves and that determine how and which packets should be dropped in the event of congestion. While queuing delays in an IP network cannot be completely eliminated, they can be managed such that the more important or delay sensitive applications receive preferential service over less important traffic, and when congestion occurs less important packets can be dropped before the more important ones.
In general, the problem of providing differing levels of quality of service (QoS) to network traffic is decomposed into traffic classification, queue management, and scheduling algorithms. Traffic classification entails assigning each packet to a class, which is typically specified by a network operator. For example, a class might be voice traffic, or file server traffic, or Web traffic between the New York and Orlando offices, etc. Typically, each class is assigned to a particular queue. More than one class may be assigned to the same queue, causing traffic from those multiple classes to be treated as a single aggregate. When different flows or collections of application sessions are aggregated in this fashion, the resulting scheme is often called Class of Service (CoS) resource management rather than QoS to emphasize the notion that network traffic is managed in a coarser grained fashion.
Queue management entails how a queue is maintained as packets are inserted and removed from the queue and which packets are dropped when the queue becomes full, or begins to become full, in the event of congestion. A first-in, first-out (FIFO) queue with a drop-tail drop policy is a simple example of a queue management scheme. More elaborate schemes such as random early detection (RED), weighted random early detection (WRED), fair queuing (FQ), weighted fair queuing (WFQ), deficit round-robin (DRR), etc., have been developed. In a common configuration, a networking device manages multiple queues for each output port. Packets are placed in the different queues according to policy that is controlled by traffic classification.
When there are multiple queues on a given port, a scheduling algorithm determines how and what queues are serviced each time there is an opportunity to transmit a packet over the output port. A scheduling algorithm is typically represented by a program code, circuit logic, or a combination, that when executed or operated by processing equipment or devices performs a process detailed by steps of the scheduling algorithm.
One of the simpler scheduling algorithms is a static-priority scheduler. In this algorithm, each queue is assigned a priority, and at each service time, the non-empty queue with the highest priority is chosen to be serviced. Another example is WFQ. While WFQ can be realized as a queue management scheme, the WFQ algorithm can also be deployed as a scheduler. For example, a collection of FIFO queues might be serviced according to a WFQ schedule, a collection of RED queues might be serviced according to a DRR schedule, or a collection of WFQ queues might be serviced according to a WFQ scheduler. This latter approach is sometimes called “hierarchical packet fair queuing” (H-PFQ), described in Bennet and Zhang, “Hierarchical Packet Fair Queuing Algorithms”, Proc. ACM SICOMM 1996.
A key problem with known scheduling and queue management algorithms is that the amount of queuing delay a flow or class experiences is related to the bandwidth or rate that is allocated to that flow. For example, in class-based WFQ, weights are assigned to each class and the link bandwidth is divided among the different classes in proportion to the weight assignment. To achieve a lower average delay for a class, the weight must be increased, which results in an increase in the rate allocated to that class. In other words, the only way to increase a class' delay priority in WFQ is to allocate a greater amount of bandwidth to that traffic class. As such, priority and bandwidth are intrinsically coupled together and are thus controlled by a single parameter. A QoS policy for traffic underlying a remote terminal application, which requires high priority but only needs moderate bandwidth, cannot be efficiently achieved. Either an excessive amount of bandwidth must be allocated or the traffic's priority must be sacrificed.
An important scheme for overcoming this undesirable coupling of delay and bandwidth management employs the use of service curves, formalized by R. L. Cruz, “Service business and dynamic burstiness measures: a framework”, Journal of High Speed Networks, Vol. 1, No. 2, 1992. A service curve defines how much network service is guaranteed to be allocated to a given network flow or traffic class at any given point in time, expressed as bits serviced versus time, presuming the flow or traffic class is active, i.e., has packets queued and ready to send. For example, FIG. 2 depicts a service curve (200) that is a straight line with slope m. A scheduler that guarantees this service curve to a network flow or class would service packets from that flow or class frequently enough to ensure a service of at least m bits per second.
In a publication entitled “Scheduling for Quality of Service Guarantees via Service Curves”, Proc. ICCCN September 1995, authors H. Sariowan, R. Cruz, and G. Polyzos proposed a specific scheduling policy called “Service Curve-based Earliest Deadline first” (SCED). While SCED represented a scheduling policy using service curves, the problem of developing a scheduling algorithm that efficiently implements guarantees for arbitrary service curves was not solved.
Generally speaking, a scheduler that is configured with service curves and can schedule traffic to adhere to the service curve specifications is called a service curve scheduler, and the guarantee of service provided to each class is called the service curve guarantee. Such a guarantee can be met by providing service in excess of the service curve requirement, and in general, when a service curve scheduler has additional available bandwidth after all guarantees are met, it can distribute that excess bandwidth in a deliberate and controlled fashion. The actual service received by a class can be any non-decreasing function of time that is equal to or greater than the service curve for all times.
A scenario where the service curve of each traffic class or flow has the form of a straight line through the origin is equivalent to WFQ where the WFQ weights are defined by the slopes of the service curve. However, even with a service curve model, such a configuration suffers from the undesirable coupling of delay and bandwidth. To decouple priority and bandwidth, the service curve must have additional degrees of freedom. For example, a two-piece, linear service curve can be employed to decouple bandwidth and delay. As shown in FIG. 3, a two-piece curve has a first slope, m1, and an x-offset, x, used to determine the traffic class' delay priority, and a second slope, m2, used to determine the long-term bandwidth allocation. This allows traffic patterns such as interactive sessions that have different priority and bandwidth allocation requirements to be efficiently represented within a single curve. In this scenario, a traffic class is allowed to burst at a relatively higher rate of m1, thereby optimizing delay, for a certain period of time x. But after that time, the scheduler throttles the rate of the class down to a lower rate of m2, which can be independent of the priority delay factor.
Benefits of a service curve scheduler can be clearly seen when more than one traffic pattern with different requirements is vying for the same resources. For example, the policy depicted in FIG. 4 specifies that voice over IP (VoIP) traffic requires low delay (high priority) and FTP traffic requires high bandwidth allocation but no delay guarantees. This policy cannot be satisfied with a one piece linear line, such as that shown in FIG. 2. Yet a service curve scheduler can satisfy such requirements by using four slopes to define two different service curves, one service curve 401 for the FTP traffic class and one service curve 400 for the VoIP traffic class, as shown in FIG. 4. In this case, packets from the VoIP class are ideally scheduled before the FTP class as long as the long-term rate of the VoIP traffic remains below the m2 rate. When this is the case, the VoIP traffic effectively earns credit against its long term allocation such that it is allowed to burst for x time units at the higher m1 rate, while the FTP traffic is delayed. By choosing m1=m2+m3 equal to the link rate, the resulting outcome is that VoIP traffic has delay priority over FTP traffic and while the long-term rates of m2 and m3 are allocated to the VoIP traffic and the FTP traffic independent of the degree of delay priority afforded to the VoIP traffic.
While the service curve framework provides a flexible and general approach to controlling and providing QoS for network traffic, a question arises as to how to distribute the excess service when a traffic class does not fully utilize the service defined by the service curve. Rather than re-distributing the excess service, a service curve scheduler could simply leave the link idle, wasting the resource, and still meet the requirements of all of the specified service curves. A more efficient approach, however, is to redistribute the excess service in some fashion.
The distribution of excess available service has been referred to as link sharing in the literature and was studied by S. Floyd and V. Jacobson, “Link-Sharing and Resource Management Models for Packet Networks, IEEE/ACM Transactions on Networking, Vol. 3, No. 4, August 1995, in a system they called class based queuing (CBQ). In CBQ, traffic classes are arranged in a hierarchy. The hierarchy can be expressed as a tree where leaf classes represent actual traffic classes, with each leaf class having its own queue and queue management scheme. Internal nodes of the tree represent sharing policies. The root node represents the full link bandwidth. Each node is assigned a percentage of the bandwidth of its parent nodes such that the percentages assigned to a set of sibling nodes of a given parent node sum to a value equal to or less than 100%. In this fashion, bandwidth is apportioned to the leaf classes according to these percentages. When a leaf class does not fully utilize its allocated percentage of the bandwidth, that bandwidth is propagated to the parent and subsequently shared among the active siblings that could otherwise make use of the bandwidth creating a situation where the sibling node is allowed to exceed its allocated percentage of the bandwidth by effectively borrowing the bandwidth from the sibling who is not using it. In turn, if those sibling nodes do not have use for the excess bandwidth, that bandwidth is propagated further up the tree to be redistributed to yet other nodes in the bandwidth hierarchy.
Using such a hierarchy allows a network operator to create hierarchical policies. FIG. 5, for example, depicts a hierarchy where 50% of the bandwidth is allocated to class A, 25% to class B, and 25% to class C. In turn, class A1 receives 50% of the class A bandwidth, or 25% of the whole link. If class A1 does not use that bandwidth, then it is redistributed to other members of class A, namely class A2. If class A2 does not use that bandwidth, then it is further redistributed to class B and C in proportion to their allocations. This provides a tool to create desirable policy structures where bandwidth limits between competing classes are enforced but if the bandwidth is not otherwise being used, it is redistributed in a meaningful fashion.
However, the CBQ framework is not based on the service curve model and instead is defined through a set of operational descriptions and heuristic bandwidth estimation techniques. As a consequence, it has not been possible to prove definitive and useful properties of the overall system. In fact, for certain workloads, CBQ has been shown to deviate from its desired behavior.
To address these problems, I. Stoica, H. Zhang, and T. S. E. N G in an article entitled “A Hierarchical Fair Service Curve Algorithm for Link-Sharing, Real-Time and Priority Service”, Proc. ACM SIGCOMM, 1997, the content of which is incorporated herein by reference in its entirety, proposed a solution for hierarchical link sharing built upon a service curve scheduler. Their scheme, called Hierarchical Fair Service Curve (HFSC), like CBQ, uses a tree to define the resource sharing policy but, in contrast to CBQ, assigns a service curve to each node in the tree rather than a percentage of bandwidth. By employing service curves, HFSC is able to control the tradeoffs between bandwidth allocation and delay priority. In addition, the service curve formalism enables the user to better define the operating behavior of an HFSC scheduler and to develop an explicit proof of the correctness of the algorithm in achieving the behavior.
While the HFSC framework is described in terms of arbitrary service curves, efficient implementations appear to be limited to only a very limited class of service curves. In particular, the systems disclosed in I. Stoica, H. Zhang, and T. S. E. N G, limit implementation to service curves that are composed of two linear pieces that must be convex or concave whereby the first segment passes through the origin. To highlight this limitation of the known practical realization of HFSC, the term HFSC2 is used herein to denote the HFSC algorithm when used with two-piece service curves.
FIG. 6 shows a pair of service curves 605 and 610 associated respectively with class 1 and class 2 traffics. The link rate is R bits/second, and m2 and m3 are selected such that m2+m3=R. Whenever class 1 is ready to be sent, its traffic is sent before traffic of class 2, up to the time x1. Because m1 is selected to be equal to the link rate, class 1 is transmitted at the link rate during the burst defined by times 0 and x1 and no other traffic is sent during this burst. If class 1 continues to be active after time x1, then the scheduler allocates service between class 1 and class 2 in proportion to the parameters m2 and m3. Because m2+m3 is selected equal to the link rate, class 1 will be serviced at m2 bits/second and class 2 at m3 bits/second. In this example, class 1 may correspond to a delay sensitive traffic like VoIP, and class 2 may correspond to all other traffic. As seen from the example shown in FIG. 6, HFSC2 algorithm implements the bandwidth and priority policies for these two classes in a manner consistent with the policy goals defined by service curves 605 and 610. However, when there are more than two traffic classes corresponding to two service curves with differing delay priorities, the problems with HFSC2 become evident, as described further below.
FIG. 7 shows service curves for three traffic classes, namely traffic class 1, class 2, and class 3, defined respectively by service curves 801, 802, and 803, respectively. In addition, there are three delay priorities corresponding to regions 720, 722 and 724 along the x-axis. The leaf nodes associated with these traffic classes are arranged into a link sharing hierarchy assuming all three classes are siblings of the parent root node, whose link rate is R bits/second. The slopes of the service curve 705 are m1 in region 720 and m2 in region 722. The slope of service curve 710 is m3, and the slope of the service curve 715 is m4. To cause class 1 to have the highest priority, m1 is chosen equal to the link rate R and x1 is chosen to be the maximum burst time for the class 1 traffic. Likewise, it would be desirable to choose m2+m3 also equal to the link rate R. Yet herein lies a problem.
Because HFSC2 allows only for two-piece linear curves, slopes m2 and m3 on service curves 705 and 710 must continue from region 722 into region 724 at the same slope. Since service curve 715 becomes non-zero in region 724, therefore, m2+m3+m4 must be less than or equal to the link rate R. Thus, if m2+m3 is selected to be equal to link rate R, m4 must be equal to 0. However, to create a useable service curve, m4 must be greater than 0 which, in turn, means that that m2+m3 must be chosen so as to be smaller than the link rate R in region 722. If m2+m3+m4 is selected to be equal to the link rate R—a desirable outcome since over the long term, all of the link rate should be fully allocated across all traffic classes—then m4 is defined by R−(m2+m3). That is, the amount of unallocated service in region 722 is m4.
Given that there must be unallocated service left in region 722, the question arises to what the HFSC2 scheduler will do with that unallocated service when all classes are active. To illustrate that, assume that all classes become active at time t=0. At this point, class 1 is serviced for x1 seconds at the link rate R. Since all the service is allocated to class 1 in this time frame, no other class is serviced. As time passes into region 722 at time x1, class 1 and 2 traffics are serviced. While in region 722, class 1 and class 2 traffics will be serviced according to the real-time criterion in order to meet the service curves of those two leaf classes. However, since m2+m3<R, there is spare, unallocated service that will be served according to the link sharing criterion. In this situation, the HFSC2 algorithm chooses the class whose virtual time is the smallest. Since class 3 has not yet been serviced at all, its virtual time is 0 while classes 1 and 2 have virtual times larger than 0 because they have been active and have been given service. Hence, class 3 will be serviced at this time. Moreover, class 1 and class 2 will continue to be serviced in region 722 to meet their real-time requirements, causing each of class 1 and class 2's virtual times to be increased according to the algorithm.
As time continues to proceed through region 722, it turns out that the virtual times of each of the classes moves forward in a manner such that class 3 receives all of the unallocated bandwidth (i.e., m4 bits/sec) of region 722, while class 1 and class 2 are given the minimum amount required to meet the service curve requirements. In effect, regions 724 and 722 begin to merge and the class 3 service curve is translated from point x2 toward point x1. In other words, the HFSC2 algorithm treats the service curve specifications in FIG. 8 the same as it does the service curve specification shown in FIG. 7. This means that there is no delay priority achievable between class 2 and class 3 even though this was the apparent design goal of the service curves in FIG. 7. This is a major shortcoming associated with the HFSC2 algorithm.
FIG. 9 depicts service curves that achieve the policy goals of the service curves associated with FIG. 7. Services curves 905 and 915 associated with traffic classes 1 and 3 are the same as service curves 705 and 715 shown in FIG. 7, but the service curve 710 associated with class 2 has been modified to be a 3-piece curve 910, having a first piece with slope 0, a second piece with slope m3 between times x and x1, and the third piece with slope m4 starting at time x2. Slope m3 is selected to be equal to the link rate R minus m2. Since m2+m3=R, all of the service will be allocated to classes 1 and 2 before it is allocated to class 3 when class 2 goes from an inactive to an active state. In other words, this arrangement of service curves achieves the desired priority and bandwidth guarantees.
The various operations that are performed on a 2-piece curve always result in another two-piece curve. Thus, the computation and data structures required to implement the process with 2-piece curves remains relatively simple. However, the various operations that are performed on a 3-piece curves no longer result in another 3-piece curve. Rather, each operation can increase the number of pieces. Thus, the data structures and computations required to implement the processes associated with a 3-piece curve can grow with each operation. This can lead to impractical computational complexity.
While service curve schedulers have been proposed and studied in the research community, they have not had much impact in practice. Their lack of widespread adoption by industry is likely rooted in the abstract and complex nature of the service curve model. A typical network operator would have difficulty not only understanding the mathematical principles and formalism of the service curve model, but would also likely be at a loss as to how to configure service curves in a networking device to effect desired QoS policies. It is neither easy for a network operator to understand and reason about service curves nor obvious how to relate such service curves to administrative policies. And the research literature has devoted no attention to the problem of designing auxiliary support systems to make an underlying service curve scheduler understandable and manageable.