1. Field of the Invention
The invention relates to packet-based switching fabrics, and more particularly to a load balancing method and apparatus for selecting an appropriate next-stage module for transmission of a data packet of variable size.
2. Description of Related Art
A switch fabric for a data network is a device that allows data from any of several input ports to be communicated switchably to any of several output ports. Early data networks were based on circuit switching, in which fixed routes were established through the fabric for each session. The peak bandwidth demand of each session was allocated to the route for the entire duration of the session. When session traffic was bursty, however, circuit switching resulted in under-utilization of network resources during the time between bursts. Packet switching was developed to overcome this disadvantage, thus improving the network utilization for bursty traffic.
Packet switched networks dynamically allocate bandwidth according to demand. By segmenting the input flow of information into units called “packets,” and processing each packet as a self-contained unit, packet switched networks allow scheduling of network resources on a per-packet basis. This enables multiple sessions to share the fabric resources dynamically by allowing their packets to be interleaved across the fabric. Typically each packet includes a header indicating its destination port, and the fabric includes a routing mechanism for determining a route through the fabric, on a per-packet basis. The present invention is concerned primarily with a routing mechanism for packet switched networks rather than circuit switched networks.
Small switching fabrics can be constructed from crossbar switches, in which input ports are connected to the rows of a grid and the output ports are connected to the columns of the grid (or vice-versa). Each input port then can be connected to any output port merely by activating the switch at the grid junction at which they intersect. Multicast data flow can be supported just as easily, by turning on more than one junction switch to connect more than one output port to a single input port.
Crossbar switches do not scale well to larger fabrics. Many larger fabrics therefore use a multi-stage network topology, in which switching from a number of input ports to a number of output ports is accomplished through one or more intermediate stages. Each stage can have one or more module, each implementing its own internal switch. In addition, in a fully connected network, all of the modules in each stage of the network have respective communication paths to all of the modules in the next stage. A basic network of this sort has three stages (input, intermediate and output), but networks with any odd number of stages theoretically can be constructed by replacing the modules in any given stage with smaller multi-stage networks in recursive fashion.
A special case of multi-stage switch networks was studied by Clos in C. Clos, “A Study of Non-Blocking Switching Networks”, Bell System Technical Journal, March 1953, vol. 32, No. 3, pp. 406-424, incorporated by reference herein. A so-called Clos network has three stages, any of which can be recursed to create effectively a network with a larger odd number of stages. All input stage modules (sometimes simply called “input modules”) of the network have an equal number of input ports, all output stage modules (sometimes simply called “output modules”) have an equal number of output ports, and all input and output modules are fully interconnected with all intermediate stage modules (sometimes simply called “intermediate modules”). Clos networks can be symmetric, in which case the number of modules and the number of ports per module on the input side match the corresponding values on the output side, or they can be asymmetric, in which case the number of modules or the number of ports per module on the input side do not necessarily match the corresponding values for the output side. A symmetric Clos network, therefore, can be characterized by a triple (m, n, r) where m is the number of modules in the intermediate stage, n is the number of input ports on each input module (the same as the number of output ports on each output module), and r is the number of modules in the input stage (the same as the number of modules in the output stage). An asymmetric Clos network must be characterized by a quintuple (m, n1, r1, n0, r0). The invention is most useful in Clos networks, but under proper circumstances it can also be used in multi-stage networks that do not strictly meet the definition of a Clos network.
Multi-stage networks scale better than pure crossbar switch networks, to a point, but also introduce the possibility of blocking operation. That is, because data from more than one input port have to share the same intermediate modules, a possibility exists that when data is ready for transmission, all possible routes to the output module having the desired destination output port might be blocked by other data flows. Theoretical formulas exist for calculating the minimum required number of intermediate stage modules and stage-to-stage data link rates in order to provide non-blocking operation given specified maximum input and output port numbers and data rates, but these minimum requirements are only necessary conditions; they are not necessarily sufficient by themselves to achieve non-blocking operation. Networks also must be designed to choose appropriate routes through the intermediate stage modules for individual data packets, and to backpressure them properly.
For example, consider a 3-stage Clos network having two input modules, two output modules, two input ports on each input module, and two output ports on each output module. Assume further that the maximum data rate per input port, the maximum data rate per output port, and the stage-to-stage link data rate, are all R. Then a necessary condition to non-blocking operation is that there be at least two intermediate stage modules. This can be seen because the total output capacity of a given one of the input modules would be 2R (R to each of the two intermediate stage modules), which is no less than the maximum total input data rate of the input module, which in this case is also 2R (R from each of the two input ports to the module). The same is true for every other module in the network. However, assume now the extreme case that the routing algorithm employed by a given one of the input modules is to always send all input packets to the first intermediate stage module and never to the second. In this case, since the data rate from an input module to a single intermediate stage module is only R, the fabric will be able to transport only half the combined data rate that was promised to the two input ports of that module, and the fabric will have to block packets from one or the other or both of such input ports whenever their combined input data rate exceeds R.
The algorithm used by an input module to decide which intermediate module to send the next packet to is known variously as a load balancing, channel balancing, or striping algorithm. Much research has been conducted into optimum load balancing algorithms. Many of the algorithms apply only to the older circuit switched networks, but many others apply to packet switched networks. The algorithms applicable to packet switched networks are the ones of interest in the present discussion.
It will be appreciated that striping algorithms are different from “fair queuing” algorithms, or queue scheduling algorithms, the purpose of which are to select which of a plurality of non-empty input queues the next packet is to be taken from for transmission across the fabric. Typically an input module requires both kinds of algorithms: a fair queuing algorithm to determine which input queue to service next, and then a striping algorithm to determine how to route the next packet from the input queue chosen by the fair queuing algorithm. A duality does exist between the two kinds of algorithms, but only in certain circumstances can a fair queuing algorithm be converted directly to a load balancing algorithm or vice versa. For example, whereas it might be desired to formulate a striping algorithm that will achieve certain goals under a particular set of striping conditions, there may be no useful dual of such a striping algorithm in the fair queuing arena because there is no useful dual of the goals or set of conditions in the fair queuing arena. In such a situation, it might not be intuitive that direct conversion of any known fair queuing algorithms will be optimal as a load balancing algorithm under the set of conditions for which a striping algorithm is being developed.
A good striping algorithm should be able to minimize the probability of blocking operation while utilizing all of the available channels in proportion to their respective capacities. One way to achieve these goals might be through the use of a global supervisor that is continually aware of queue lengths in all channels, and uses this information to choose the best route for the next packet. This solution does not scale well, however, for a number of reasons. First, as the number of input and output ports grow, and channel data rates increase, it becomes increasingly difficult to design logic circuitry that is fast enough to make all the required calculations in time for each packet. Second, it also becomes increasingly difficult to design in sufficient control signal capacity to transmit the information from all the various queues in the network back to the supervisor. The latter problem is only exacerbated when the various ports, queues and routes are spread out over multiple chips, boards or systems.
Because of these problems, a number of different striping algorithms have been developed for three-stage networks which do not require direct knowledge of downstream queue lengths. These algorithms therefore avoid (or at least reduce the amount of) control signaling required across the network. Because these algorithms rely on probabilities rather than deterministic calculations, they achieve the goals of non-blocking operation and fair channel usage with varying degrees of success in different circumstances.
In one such algorithm, known as round robin (RR) striping, packets are sent from the input stage to the intermediate stage modules in a round-robin order. This algorithm is generally simple to implement, but it does not take account of different bandwidth capacities available on different channels. For switching fabrics having different capacities on different channels, a weighted round robin (WRR) striping algorithm is known, in which during each round robin cycle, the number of packets transmitted on each channel is proportional to the capacity of that channel.
Round robin and weighted round robin striping algorithms, however, do not optimize load balancing when the packet size is variable. As an example, consider a Clos network having two modules in the intermediate stage, equal data rates on all channels, and a sequence of packets to send which alternate in size between large and small. In this case an input module implementing a round-robin striping algorithm will alternate striping between the two intermediate stage modules and will do so synchronously with the packet size. All the large size packets will therefore be sent through one of the intermediate stage modules (call it intermediate stage module #1) while all the small size packets will be sent through the other intermediate stage module (call it intermediate stage module #2). The algorithm therefore does not maximally utilize all of the available channels in proportion to their respective capacities. Nor does it ensure non-blocking operation, because the input module might have to hold up large size packets in its output queue for intermediate stage module #1, even though the route through intermediate stage module #2 might be clear. Still further, if the traffic is not well balanced across the links, then some links may be oversubscribed, i.e, presented with traffic whose rate exceeds that of the link. In the event that this imbalance persists for long enough, the node that oversubscribes the link can accumulate excess traffic until it overflows and is forced to drop packets.
In order to address issues of variable packet size, a striping algorithm known as deficit round robin (DRR) has been developed. DRR striping is described for example in H. Adiseshu, G. Parulkar, and G. Varghese, “A Reliable and Scalable Striping Protocol,” in Proceedings of ACM SIGCOMM '96, pp. 131-141 (1996), incorporated by reference herein. According to the DRR algorithm, a credit count is maintained for each channel. Before packets are sent on a current channel, a quantum is added to the credit count for that channel. If channel capacities differ, then the quantum for each channel can be proportional to the relative capacity of that channel (Deficit Weighted Round Robin—DWRR). Then, if the length of the packet is smaller than the credit count for the current channel, then the packet is sent on that channel and the credit counter for that channel is reduced by the length of the packet. The sender continues sending packets on the current channel, concomitantly reducing the credit count for that channel, until the length of the next packet to send is greater than the credit count for the current channel. The sender then moves on to the next channel in round robin sequence, adds the quantum to the credit count for the new channel, and tests the count against the length of the new packet. As with RR and WRR, DRR and DWRR algorithms can be implemented in a distributed manner to improve scalability.
It will be appreciated that the DRR algorithm can be expressed in a number of different ways while still achieving the same or approximately the same packet striping sequence. For example, instead of comparing credit counts with packet lengths, the determination of whether to send the next packet on the current channel can be made simply on the basis of whether the credit count for that channel is greater than (or no less than) some fixed threshold, such as zero. Other examples will be apparent.
Both round robin and weighted round robin striping algorithms achieve the goals of non-blocking operation and fair channel usage best when the algorithm can be implemented with fixed size packets, globally across all input queues. Similarly, DRR and DWRR striping algorithms also are most successful when the algorithm can be implemented globally across all input queues. In many kinds of fabrics, however, the input queues are distributed across multiple input modules. Coordination among the input queues becomes more and more difficult as the number of input modules increases, thereby stifling the scalability of the network. In this case a fabric might be designed in which each input module implements its own striping algorithm, without coordinating with the other input modules. This solution, however, leaves open a risk that two or more input modules will synchronize. Synchronization can be problematical especially in the context of a multistage switch fabric, in which multiple senders share common data paths for part of the route to each destination (e.g. the part of the route from the intermediate stage module to the output module).
As an extreme example, consider a DRR fabric having 16 ingress modules, 16 egress modules and 5 intermediate stage modules fully interconnected to all the ingress and egress modules. Ignoring any speedup that might be incorporated into the fabric, assume that each data path internal to the fabric has a data rate of R and each fabric ingress and egress port has a data rate of 4R. The minimum conditions for non-blocking operation are satisfied in such a fabric, since five parallel routes of data rate R to a given destination should together be able to handle an input data rate of 4R to the same destination. But assume that the traffic arriving on the first ingress module consists of packets destined for egress modules 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, and so on in perpetual rotation. Assume further that the traffic arriving on each of the second through fifth ingress modules have the same sequence of destinations, and that at the time this traffic pattern begins, all five of such ingress modules have their DRR pointer pointing to the first channel. Assume further that all packets have equal size, so that DRR reduces to simple RR. In this situation, then, all five of the ingress modules will send their packets destined for egress module i via intermediate module i, where i=1, 2, 3, 4, 5, 1, 2, 3, . . . . Each i'th intermediate module thus receives data from each of the five ingress modules at a rate of 4R/5, for a total rate of data received by the intermediate module of 4R. All the data received by intermediate stage module i is destined for egress module i. But the maximum data rate between any intermediate stage module and any egress module is only R, so the fabric overall will be able to transport data at a rate of only R from each ingress module, which represents only 25% utilization (again, ignoring any speedup). It can be seen, therefore, that the unmodified DRR striping algorithm can lead to a blocking fabric with a utilization that is far below 100%, at least in the event of synchronization.
Other deficit-based striping algorithms are also known. As one example, each next packet is striped to whichever route has carried the fewest total number of bytes so far. This technique might succeed at evenly distributing the traffic across multiple routes all originating from one input module, but does not avoid problems of synchronization among multiple input modules. It can be seen, in fact, that if all incoming traffic has a fixed packet size, and depending on the algorithm used to break ties among more than one of the routes, this technique reduces essentially to conventional round robin striping.
Another known striping algorithm involves striping all packets in a strict pseudorandom sequence. This technique might successfully avoid synchronization among multiple input modules if the seeds in each module can be sufficiently randomized, but precludes the determination of a theoretical bound on the differences in channel usage at any point in time. Such a bound is important for limiting the maximum difference in latencies among all the channels, and therefore the amount of re-ordering memory required at the output ports of the system.
Accordingly, there is an urgent need for a switch fabric architecture that can achieve full throughput and maximum channel usage, with either fixed or variable size data packets. As an important part of achieving this goal, there is an urgent need for a new striping algorithm that can be implemented in a distributed manner, that will not bog down due to synchronization, and that permits determination of an upper bound on channel usage differences at any point in time. Furthermore, it would be extremely desirable that the striping algorithm be simple enough to integrate into a small space on an integrated circuit chip, so that it can be used on every ingress line card of a very large packet switching system.