The present invention relates to computer networks, and more particularly to a method and system for improving control over and resource allocation for a computer network, particularly a computer network capable of providing differentiated services.
Driven by increasing usage of a variety of network applications, such as those involving the Internet, computer networks are of increasing interest. In order to couple portions of a network together or to couple networks, switches are often used. For example, FIG. 1A depicts a high-level block diagram of a switch 10 which can be used in a computer network. The switch 10 includes a switch fabric 3 coupled with blades 7, 8 and 9. Each blade 7, 8 and 9 is generally a circuit board and includes at least a network processor 2 coupled with ports 4. Thus, the ports 4 are coupled with hosts (not shown). The blades 7, 8 and 9 can provide traffic to the switch fabric 3 and accept traffic from the switch fabric 3. Thus, any host connected with one of the blades 7, 8 or 9 can communicate with another host connected to another blade 7, 8 or 9 or connected to the same blade.
FIG. 1B depicts a high-level block diagram of one embodiment of a network processor 2. The network processor 2 includes an ingress switch interface (ingress SWI) 11, an ingress enqueue/dequeue/scheduling logic (ingress EDS) 12, an embedded processor complex (EPC) 13, an ingress physical MAC, multiplexer (ingress PMM) 14, and egress physical MAC multiplexer (egress PMM) 15, an egress enqueue/dequeue/scheduling logic (egress EDS) 16 and an egress switch interface (egress SWI) 17. The network processor 2 may also contain other storage and processing devices. The ingress SWI 11 and egress SWI 17 are coupled with the switch fabric 24 (depicted in FIG. 1A) for the switch 10. Referring back to FIG. 1B, the EPC 13 includes a number of protocol processors plus co-processors. The ingress EDS 12 and egress EDS 16 can perform certain enqueuing, dequeuing and scheduling functions for traffic traveling from devices, such as Ethernet devices, to the switch fabric and for traffic traveling from the switch fabric to the devices, respectively. The ingress SWI 11 and egress SWI 17 provide links for connecting to other devices, such as another network processor or switch (not shown in FIG. 1B). The ingress PMM 14 and egress PMM 15 receive traffic from and transmit traffic to, respectively, physical layer devices.
FIG. 2A depicts another simplified block diagram of the switch 10, illustrating some of the functions performed by network processors 2. Although some of the functions are performed by the same components as shown in FIG. 1A, these components may be labeled differently. For example, for the purposes of explaining the path of traffic through the switch 10, the switch fabric 3 of FIG. 1A is depicted as switch fabric 26 in FIG. 2A. The switch 10 couples hosts (not shown) connected with ports A 18 with those hosts (not shown) connected with ports B 36. Thus, the switch 10 allows packets of data to be transferred from the source to the destination. Data packets could include a number of different types of data packets. For example, Ethernet packets (usually termed frames), ATM packets (usually termed cells) and IP packets (usually termed packets) will all be packets herein. The switch 10 performs various functions including classification of data packets provided to the switch 10, transmission of data packets across the switch 10 and reassembly of packets. These functions are provided by the classifier 22, the switch fabric 26 and the reassembler 30, respectively. The classifier 22 classifies packets which are provided to it and breaks each packet up into convenient-sized portions, which will be termed cells. The switch fabric 26 is a matrix of connections through which the cells are transmitted on their way through the switch 10. The reassembler 30 reassembles the cells into the appropriate packets. The packets can then be provided to the appropriate port of the ports B 36, and output to the destination hosts. The classifier 19 may be part of one network processor 1, while the reassembler 30 may be part of another network processor 5. The portions of the network processor 1 and the network processor 5 depicted perform functions for traffic traveling from ports A 18 and to ports B 36, respectively. However, the network processors 1 and 5 also perform functions for traffic traveling from ports B 36 and to ports A 18, respectively. Thus, each network processor 1 and 5 can perform classification and reassembly functions. Furthermore, each network processor 1 and 5 can be a network processor 2 shown in FIGS. 1A and 1B.
Referring back to FIG. 2A, due to bottlenecks in transferring traffic across the switch 10, data packets may be required to wait prior to execution of the classification, transmission and reassembly functions. As a result, queues 20, 24, 28 and 34 may be provided. Coupled to the queues 20, 24, 28 and 34 are enqueuing mechanisms 19, 23, 27 and 32. The enqueuing mechanisms 19, 23, 27 and 32 place the packets into the corresponding queues 20, 24, 28 and 34 and can provide a notification which is sent back to the host from which the packet originated. The classification, enqueuing, and scheduling functions are preferably provided by the ingress EDS 12 and egress EDS 16 in the network processor depicted in FIG. 1B. Referring to FIGS. 1B and 2A, the enqueuing mechanisms 19 and 23, the queues 20 and 24, the classifier 22 and the schedulers 21 and 25 are controlled using the ingress EDS 12. Similarly, the enqueuing mechanisms 27 and 32, the queues 28 and 34, the reassembler 30 and the schedulers 29 and 35 are controlled using the egress EDS 16.
Also depicted in FIG. 2A are schedulers 21, 25, 29 and 35. The schedulers control the scheduling of individual packets which are to leave the queues 20, 24, 28 and 34, respectively. In general, the concern of the present application is the egress portion of the network processor 2, depicted by egress PMM 15, egress EDS 16 and egress SWI 17 in FIG. 1B. Thus, referring back to FIG. 2A, one focus of the present invention includes the scheduler 35 which controls the traffic to ports B 36. Consequently, for clarity, the function of schedulers is discussed with regard to the scheduler 35 and the queue 34. Typically, the scheduler 35 is provided with information relating to each packet in the queue 34. This information may include the type of the packet, such as a real-time packet for which time of transmission is important, or a data packet for which the speed of transmission is not important. Based on this information and other information provided to it, the scheduler 35 determines each individual packet in the queue 34 will be removed from the queue and sent on towards its destination. For example, the scheduler 35 may include one or more calendars (not shown), each including a number of positions, and a weighted fair queuing ring (not shown) including another number of positions. The scheduler 35 may place certain packets in the calendar and other packets in the ring. The scheduler allocates a certain amount of time to each position in the calendar. Each position in the calendar can have a single packet, typically represented by an identifier, or can be empty. When the scheduler reaches a certain position, a packet placed at that position will be retrieved from the queue and sent toward its destination. If, however, the position in the calendar is empty, the scheduler 35 waits until a particular amount of time has passed, then moves to the next position in the calendar. Similarly, the scheduler 35 places other packets in positions of the weighted fair queuing ring of the scheduler 35. A position in the weighted fair queuing ring can also be either occupied by a single packet or empty. If the position is occupied, then the scheduler 35 sends the packet in the position upon reaching the position. If the position is unoccupied, the scheduler 35 skips to the next occupied position Thus, using the scheduler 35 to control individual packets leaving the queue 34, traffic can flow through the switch 10.
Although the queues 20, 24, 28 and 34 are depicted separately, one of ordinary skill in the art will readily realize that some or all of the queues 20, 24, 28 and 34 may be part of the same physical memory resource. FIG. 2B depicts one such switch 10xe2x80x2. Many of the components of the switch 10xe2x80x2 are analogous to components of the switch 10. Such components are, therefore, labeled similarly. For example, the ports A 18xe2x80x2 in the switch 10xe2x80x2 correspond to the ports A 18 in the switch 10. In the switch 10xe2x80x2, the queue A 19 and the queue B 24 share a single memory resource 31. Similarly, the queue C 28 and the queue D 34 are part of another single memory resource 33. Thus, in the switch 10xe2x80x2, the queues 20, 24, 28 and 34 are logical queues partitioned from the memory resources 32 and 33.
Currently, most conventional switches 10 have only two mechanisms for controlling the flow of traffic through the switch 10. One conventional method for controlling the flow of traffic through the switch 10 attempts to ensure that the memory relevant memory resource, such as the queue 16, is not overloaded. This conventional method is known as RED (random early discard or detection). FIG. 3 depicts the conventional method 40 used in RED. The conventional method 40 is typically used by one of the enqueuing mechanisms 19, 23, 27, 32, 19xe2x80x2, 23xe2x80x2, 27xe2x80x2 and 32xe2x80x2 to control the traffic through the corresponding queue 20, 24, 28, 34, 20xe2x80x2, 24xe2x80x2, 28xe2x80x2 and 34xe2x80x2 respectively. For the purposes of clarity, the method 40 will be explained with reference to the enqueuing mechanism 32 and the queue 34.
At the end of a short period of time, known as an epoch, a queue level of the queue 34 for the epoch is determined by the enqueuing mechanism 32, via step 41. Note that the queue level determined could be an average queue level for the epoch. In addition, the queue level determined could be the total level for the memory resource of which the queue 34 is a part. It is then determined if the queue level is above a minimum threshold, via step 42. If the queue level is not above the minimum threshold, then a conventional transmit fraction is set to one, via step 43. Step 43, therefore, also sets the conventional discard fraction to be zero. The transmit fraction determines the fraction of packets that will be transmitted in the next epoch. The conventional discard fraction determines the fraction of packets that will be dropped. The conventional discard fraction is, therefore, equal to one minus the conventional transmit fraction. A transmit fraction of one thus indicates that all packets should be transmitted and none should be dropped.
If it is determined in step 42 that the queue level is above the minimum threshold, then it is determined whether the queue level for the epoch is above a maximum threshold, via step 44. If the queue level is above the maximum threshold, then the conventional transmit fraction is set to zero and the conventional discard fraction set to one, via step 45. If the queue level is not above the maximum threshold, then the conventional discard fraction is set to be proportional to the queue level of the previous epoch divided by a maximum possible queue level or, alternatively, to some other linear function of the queue level, via step 46. Thus, the conventional discard fraction is proportional to the fraction of the queue 34 that is occupied or some other linear function of the queue level. In step 46, therefore, the conventional transmission is also set to be proportional to one minus the conventional discard fraction. The conventional transmit fraction and the conventional discard fraction set in step 43, 45 or 46 are then utilized for the next epoch to randomly discard packets, via step 47. Thus, when the queue level is below the minimum threshold, all packets will be transmitted by the enqueuing mechanism 32 to the queue 34 during the next epoch. When the queue level is above a maximum threshold, then all packets will be discarded by the enqueuing mechanism 32 during the next epoch or enqueued to a discard queue. When the queue level is between the minimum threshold and the maximum threshold, then the fraction of packets discarded by the enqueuing mechanism 32 is proportional to the fraction of the queue 34 that is occupied or some other linear function of the queue level. Thus, the higher the queue level, the higher the fraction of packets discarded. In addition, a notification may be provided to the sender of discarded packets, which causes the sender to suspend sending additional packets for a period of time. The individual packets which are selected for discarding may also be randomly selected. For example, for each packet, the enqueuing mechanism 32 may generated a random number between zero and one. The random number is compared to the conventional discard fraction. If the random number is less than or equal to the conventional discard fraction, then the packet is dropped. Otherwise, the packet is transmitted to the queue 34. This process of discarding packets based on the transmit fraction is continued until it is determined that the epoch has ended, via step 48. When the epoch ends, the method 40 commences again in step 41 to determine the conventional transmit fraction for the next epoch and drop packets in accordance with the conventional transmit fraction during the next epoch.
Because packets can be discarded based on the queue level, the method 40 allows some control over the traffic through the switch 10 or 10xe2x80x2. As a result, fewer packets may be dropped due to droptail than in a switch which does not have any mechanism for discarding packets before the queue 34 becomes full. Droptail occurs when packets must be dropped because a queue is full. As a result, there is no opportunity to account for the packet""s priority in determining whether to drop the packet. Furthermore, in some situations, the method 40 can reduce the synchronization of hosts sending packets to the switch 10 or 10xe2x80x2. This occurs because packets may be dropped randomly, based on the conventional transmit fraction, rather than dropping all packets when the queue level is at or near the maximum queue level. Performance of the switch 10 and 10xe2x80x2 is thus improved over a switch that does not utilize RED, that is, a switch that simply drops next arriving packets when its buffer resources are depleted.
Although the method 40 improves the operation of the switches 10 and 10xe2x80x2, one of ordinary skill in the art will readily realize that in many situations, the method 40 fails to adequately control traffic through the switch 10 or 10xe2x80x2. Despite the fact that packets, or cells, may be dropped before the queue becomes full, the hosts tend to become synchronized in some situations. This is particularly true for moderate or higher levels of congestion of traffic in the switch 10 or 10xe2x80x2. The conventional transmit fraction is based on the queue level. However, the queue level may not be indicative of the state of the switch. For example, a queue level below the minimum threshold could be due to a low level of traffic in the switch 10 or 10xe2x80x2 (a low number of packets passing through the switch 10 or 10xe2x80x2). However, a low queue level could also be due to a large number of discards in the previous epoch because of high traffic through the switch 10. If the low queue level is due to a low traffic level, increasing the conventional transmit fraction is appropriate. If the low queue level is due to a high discard fraction, increasing the conventional transmit fraction may be undesirable. The conventional method 40 does not distinguish between these situations. As a result, the conventional transmit fraction may be increased when it should not be. When this occurs, the queue may become rapidly filled. The transmit fraction will then be dropped, and the queue level will decrease. When the queue level decreases the transmit fraction will increase, and the queue may become filled again. The switch 10 or 10xe2x80x2 thus begins to oscillate between having queues full and queues empty. As a result, the average usage of the switch 10 or 10xe2x80x2 becomes quite low and the performance of the network using the switch 10 or 10xe2x80x2 suffers.
A second conventional method for controlling traffic across the switch is used to provide customers with different services based, for example, on the price paid by a consumer for service. A consumer may wish to pay more to ensure a faster response or to ensure that the traffic for the customer will be transmitted even when traffic for other customers is dropped due to congestion. Thus, the concept of differentiated services has been developed. Differentiated services can provide different levels of service, or flows of traffic through the network, for different customers.
DiffServ is an emerging Internet Engineering Task Force (IETF) standard for providing differentiated services (see IETF RFC 2475 and related RFCs). DiffServ is based on behavior aggregate flows. A behavior aggregate flow can be viewed as a pipeline from one edge of the network to another edge of the network. Within each behavior aggregate flow, there could be hundreds of sessions between individual hosts. However, DiffServ is unconcerned with sessions within a behavior aggregate flow. Instead, DiffServ is concerned with allocation of bandwidth between the behavior aggregate flows. According to DiffServ, excess bandwidth is to be allocated fairly between behavior aggregate flows. Furthermore, Differv provides criteria, discussed below, for measuring the level of service provided to each behavior aggregate flow.
One conventional mechanism for providing different levels of services utilizes a combination of weights and a queue level to provide different levels of services. FIG. 4 depicts such a conventional method 50. The queue level thresholds and weights are set, via step 52. Typically, the queue level thresholds are set in step 52 by a network administrator turning knobs. The weights can be set for different pipes, or flows, through a particular queue, switch 10 or network processor 1 or 5. Thus, the weights are typically set for different behavior aggregate flows. The instantaneous queue levels, averaged queue levels, instantaneous pipe flow rates, or averaged pipe flow rates are observed, typically at the end of a period of time known as an epoch, via step 54. The flows for the pipes are then changed based on how the queue level compares to the queue level threshold and on the weights, via step 56. Flows for pipes having a higher weight undergo a greater change in step 56. The queue value or pipe flow rate for a pipe determines what fraction of traffic offered to a queue, such as the queue 34, by the pipe will be transmitted to the queue 34 by the corresponding enqueuing mechanism, such as the enqueuing mechanism 32. Traffic is thus transmitted to the queue or dropped based on the flows, via step 58. A network administrator then determines whether the desired levels of service are being met, via step 60. If so, the network administrator has completed his or her task. However, if the desired level of service is not achieved, then the queue level or pipe flow level thresholds and, possibly, the weights are reset, via step 52 and the method 50 repeats.
Although the method 50 functions, one of ordinary skill in the art will readily realize that it is difficult to determine what effect changing the queue level thresholds will have on particular pipes through the network. Thus, the network administrator using the method 50 may have to engage in a great deal of experimentation before reaching the desired flow rate for different customers, or pipes (behavior aggregate flows) in a computer.
Furthermore, the method 50 indirectly operates on parameters that are typically use to measure the quality of service. Queue levels are not a direct measure of criteria typically used for service. Typically, for example in DiffServ (see IETF RFC 2475 and related RFCs), levels of service are measured by four parameters: drop rate, bandwidth, latency and jitter. The drop rate is the percentage of traffic that is dropped as it flows across a switch. The bandwidth of a behavior aggregate flow is a measure of the amount of traffic for the behavior aggregate flow which crosses the switch and reaches its destination. Latency is the delay incurred in sending traffic across the network. Jitter is the variation of latency with time. The queue levels are not considered to be a direct measure of quality of service. Thus, the method 50 does not directly address any of the criteria for quality of service. Thus, it is more difficult for a network administrator to utilize the method 50 for providing different levels of service.
Another conventional method for controlling traffic utilizes flows, minimum flows rates, weights, priorities, thresholds and a signal indicating that excess bandwidth, or ability to transmit traffic, exists in order to control flows. However, it is not clear that this conventional method is a stable mechanism for controlling traffic through the switch. Consequently, this conventional method may not adequately control traffic through the as switch 10.
Moreover, even when using the method 40 depicted in FIG. 3 (conventional RED) and the method 50 depicted in FIG. 4, a scheduler such as the scheduler 35 may be provided with a larger amount of work than can be accomplished in a given time. In particular, if too many packets are desired to be removed from the queue 34 in a particular amount of time, the scheduler may be unable to cope with the traffic through the switch 10. For example, the scheduler 35 may be capable of ensuring that packets removed from the queue 34 are forwarded toward their final destination, such as a target port, at a first rate. The first rate may be limited by a variety of factors, such as the ability of the target port to accept traffic. The method 40 and the method 50 may allow packets to be removed from the queue at a second rate. If the second rate is larger than the first rate, then packets begin to back up in the scheduler 35. For example, if the scheduler 35 includes a calendar and a weighted fair queuing ring, then all of the positions in the calendar and the weighted fair queuing ring can eventually become occupied. The packets will thus be stalled from leaving the queue 34 until a position in the calendar or the weighted fair queuing ring in the scheduler 35 becomes open. As a result, the latency for packets traveling through the switch 10 will increase. Traffic will thus be slowed. Consequently, the switch 10 will not function as efficiently as desired.
Accordingly, what is needed is a system and method for improving the efficiency of a switch while providing differentiated services. The present invention addresses such a need.
The present invention provides a method and system for controlling a plurality of pipes in a computer network. The computer network includes at least one processor and a switch. The at least one processor includes a queue. The plurality of pipes use the queue for transmitting traffic through the switch. The method and system comprise allowing a minimum flow and a maximum flow to be set for each of the plurality of pipes. The method and system also comprise determining if excess bandwidth exists for the queue, a queue level for the queue, and an offered rate of the plurality of packets to the queue. The method and system also comprise controlling a global transmit fraction of the plurality of packets to the queue. The global transmit fraction is controlled based on the queue level and the offered rate so that the transmit fraction and the queue level are critically damped if the queue level is between at least a first queue level and a second queue level. The method and system also comprise setting a transmit fraction for a flow for a pipe of the plurality of pipes to be a minimum of the global transmit fraction and a differential transmit fraction. The differential transmit fraction can linearly increase the flow based on the minimum flow or the maximum flow if excess bandwidth exists and if the flow for the pipe is less than the maximum flow. The differential transmit fraction can also exponentially decrease the flow for the pipe based on the minimum flow or the maximum flow if excess bandwidth does not exist and the flow is greater than the minimum flow. Thus traffic through the queue is stable. The method and system also comprise controlling transmission of traffic to the queue based on the transmit fraction and utilizing a scheduler to control traffic from the queue. The scheduler can therefore send frames in various pipes according to the policies an administrator specifies. During a time of oversubscription, the upstream flow control will react first with per pipe per port discards to restrain the workload sent to the scheduler to a feasible workload. Thus, the differential transmit fraction can aid in controlling the workload sent to the scheduler. In the even of a large surge of traffic, the portion of flow control which reacts to the shared memory depletion might also help to keep the workload feasible by declaring a generic transmit rate for the entire blade at a value less than one. In other words, the global transmit fraction may also aid in controlling the workload sent to the scheduler.
According to the system and method disclosed herein, the present invention provides an improved mechanism for controlling traffic through a switch using a scheduler while providing differentiated services.