Driven by increasing usage of a variety of network applications, such as those involving the Internet, computer networks are of increasing interest. In order to couple portions of a network together or to couple networks, switches are often used. For example, FIG. 1 depicts a high-level block diagram of a switch 10 which can be used in a computer network. The switch 10 includes a switch fabric 24 coupled with blades 7, 8 and 9. Each blade 7, 8 and 9 is generally a circuit board and includes at least a network processor 2 coupled with ports 4. Thus, the ports 4 are coupled with hosts (not shown). The blades 7, 8 and 9 can provide traffic to the switch fabric 24 and accept traffic from the switch fabric 24. Thus, any host connected with one of the blades 7, 8 or 9 can communicate with another host connected to another blade 7, 8 or 9 or connected to the same blade.
FIG. 2A depicts another simplified block diagram of the switch 10, illustrating some of the functions performed by network processors. The switch 10 couples hosts (not shown) connected with ports A 12 with those hosts (not shown) connected with ports B 36. The switch 10 performs various functions, including classification of data packets provided to the switch 10, transmission of data packets across the switch 10 and reassembly of packets. These functions are provided by the classifier 18, the switch fabric 20 and the reassembler 30, respectively. The classifier 18 classifies packets which are provided to it and breaks each packet up into convenient-sized portions, which will be termed cells. The switch fabric 24 is a matrix of connections through which the cells are transmitted on their way through the switch 10. The reassembler 30 reassembles the cells into the appropriate packets. The packets can then be provided to the appropriate port of the ports B 36, and output to the destination hosts. The classifier 14 may be part of one network processor 1, while the reassembler 30 may be part of another network processor 5. The portions of the network processor 1 and the network processor 5 depicted perform functions for traffic traveling from ports A 12 and to ports B 36, respectively. However, the network processors 1 and 5 also perform functions for traffic traveling from ports B 36 and to ports A 12, respectively. Thus, each network processor 1 and 5 can perform classification and reassembly functions. Furthermore, each network processor 1 and 5 can be a network processor 2 shown in FIG. 1.
Referring back to FIG. 2A, due to bottlenecks in transferring traffic across the switch 10, data packets may be required to wait prior to execution of the classification, transmission and reassembly functions. As a result, queues 16, 22, 28 and 34 may be provided. Coupled to the queues 16, 22, 28 and 34 are enqueuing mechanisms 14, 20, 26 and 32. The enqueuing mechanisms 14, 20, 26 and 32 place the packets or cells into the corresponding queues 16, 22, 28 and 34 and can provide a notification which is sent back to the host from which the packet originated.
Although the queues 16, 22, 28 and 34 are depicted separately, one of ordinary skill in the art will readily realize that some or all of the queues 16, 22, 28 and 34 may be part of the same physical memory resource. FIG. 2B depicts one such switch 10′. Many of the components of the switch 10′ are analogous to components of the switch 10′. Such components are, therefore, labeled similarly. For example, the ports A 12′ in the switch 10′ correspond to the ports A 12 in the switch 10. In the switch 10′, the queue A 16 and the queue B 22 share a single memory resource 19. Similarly, the queue C 28 and the queue D 34 are part of another single memory resource 31. Thus, in the switch 10′, the queues 16, 22, 28 and 34 are logical queues partitioned from the memory resources 19 and 31.
Currently, most conventional switches 10 treat flows of traffic across the network, in which the switch is used, the same. There is, however, a trend toward providing customers with different services based, for example, on the price paid by a consumer for service. A consumer may wish to pay more to ensure a faster response or to ensure that the traffic for the customer will be transmitted, even when traffic for other customers is dropped due to congestion. Thus, the concept of differentiated services has been developed. Differentiated services can provide different levels of service, or flows of traffic through the network, for different customers.
DiffServ is an emerging Internet Engineering Task Force (IETF) standard for providing differentiated services (see IETF RFC 2475 and related RFCs). DiffServ is based on behavior aggregate flows. A behavior aggregate flow can be viewed as a pipeline from one edge of the network to another edge of the network. Within each behavior aggregate flow, there could be hundreds of sessions between individual hosts. However, DiffServ is unconcerned with session within a behavior aggregate flow. Instead, DiffServ is concerned with allocation of bandwidth between the behavior aggregate flows. According to DiffServ, excess bandwidth is to be allocated fairly between behavior aggregate flows. Furthermore, DiffServ provides criteria, discussed below, for measuring the level of service provided to each behavior aggregate flow.
One conventional mechanism for providing different levels of services utilizes a combination of weights and a queue level to provide different levels of services. FIG. 3 depicts such a conventional method 50. The queue level thresholds and weights are set, via step 52. Typically, the queue level thresholds are set in step 52 by a network administrator turning knobs. The weights can be set for different pipes or flows, through a particular queue, switch 10 or network processor 1 or 5. Thus, the weights are typically set for different behavior aggregate flows. The queue levels are observed, typically at the end of a period of time known as an epoch, via step 54. The flows for the pipes are then changed based on how the queue level compares to the queue level threshold and on the weights, via step 56. Flows for pipes having a higher weight undergo a greater change in step 56. The flow for a pipe determines what fraction of traffic offered to a queue, such as the queue 15, by the pipe will be transmitted to the queue 16 by the corresponding enqueuing mechanism, such as the enqueuing mechanism 14. Traffic is thus transmitted to the queue or dropped based on the flows, via step 58. A network administrator then determines whether the desired levels of service are being met, via step 60. If so, the network administrator has completed his or her task. However, if the desired level of service is not achieved, then the queue level thresholds and, possibly the weights, are reset via step 52, and the method 50 repeats.
Although the method 50 functions, one of ordinary skill in the art will readily realize that it is difficult to determine what effect changing the queue level thresholds will have on particular pipes through the network. Thus, the network administrator using the method 50 may have to engage in a great deal of experimentation before reaching the desired flow rate for different customers, or pipes (behavior aggregate flows) in a computer.
Moreover, “Absolute priority bandwidth allocation” is required by some customers. This means that traffic is organized into N priorities with N>1. Each pipe may be assigned a priority class, with each class conventionally designated from highest priority to lowest by the labels P0, P1, . . . , PN-1. The lowest priority PN-1 might be also called “Best Effort.” For example, for a given customer, an email data packet may not require rapid delivery, such as within one second, but the customer may require that a file transfer protocol (FTP) session involving an inventory update be transmitted as soon as possible. Therefore, the FTP file may be assigned the highest priority, and the email a lower priority.
Absolute priority bandwidth allocation typically means that if any Pi packets are awaiting service in a queue, then they must all be served before any Pi+1 packets. With infinite storage and infinite time to live (no expiration date and constant value over time), one could go through all stored packets of one priority class on a first-in/first-out basis (FIFO), and then serve all of the packets of next priority class. However, if the amount of system storage is finite, or if the time to live of a packet is finite (as is almost always the case), then the definition and practice of optimal performance with priority becomes difficult. In particular, strict adherence to priority might in some cases imply that if the lowest class, “Best Effort”, is ever served, then only very stale Best Effort packets are processed on a FIFO basis. Therefore, the concept of absolute priority bandwidth allocation requires clarification in any real system.
The goal of approximating absolute priority bandwidth allocation must be balanced with other goals as follows:
(1) good approximation of absolute priority bandwidth allocation with finite storage capacity and limited “times to live” for packets handled;
(2) high utilization of the processor;
(3) fast reaction to changing mixes of offered traffic and, in particular, fast allocation to a burst of relatively high priority traffic;
(4) simplicity of implementation;
(5) flexibility (handling any number of priorities, preferably up to about eight);
(6) resistance to storage overflow for oversubscription, preferably up to about four to one (overflow would cause dropping of the next packet regardless of priority); and
(7) stability as the mix priorities and rates of offered traffic change, wherein the system does not severely punish low priority traffic due to a brief burst of high priority traffic.
What is needed is a system and method for absolute priority bandwidth allocation that can meet the above seven goals.